# ESTIMATES OF LOCATION: 
## It provides a single value around which data points tend to cluster or spread out.
## It gives a descriptive summary of the numerical data

## TASK :
### Kaggle Dataset: World Happiness Report (158 entries x 12 features)
#### 1) Find the median of each numerical feature of this dataset within each region.
#### 2) Calculate the happiness score trimmed mean of the world to reduce the influence of outliers.
#### 3) Calculate the average happiness score of each region, then their respective trimmed means.
#### 4) Compute the weighted mean and median of the two following factors: 
####      GDP per capita and life expectancy, both weighted by the family support.

In [1]:
!pip install wquantiles

Collecting wquantiles
  Downloading wquantiles-0.6-py3-none-any.whl.metadata (1.1 kB)
Downloading wquantiles-0.6-py3-none-any.whl (3.3 kB)
Installing collected packages: wquantiles
Successfully installed wquantiles-0.6


In [2]:
import numpy as np
import pandas as pd
import scipy.stats as sstats
import wquantiles

In [3]:
df = pd.read_csv('/kaggle/input/world-happiness/2015.csv')

In [4]:
df.head(15) # Show the first 15 entries of the dataset

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176
5,Finland,Western Europe,6,7.406,0.0314,1.29025,1.31826,0.88911,0.64169,0.41372,0.23351,2.61955
6,Netherlands,Western Europe,7,7.378,0.02799,1.32944,1.28017,0.89284,0.61576,0.31814,0.4761,2.4657
7,Sweden,Western Europe,8,7.364,0.03157,1.33171,1.28907,0.91087,0.6598,0.43844,0.36262,2.37119
8,New Zealand,Australia and New Zealand,9,7.286,0.03371,1.25018,1.31967,0.90837,0.63938,0.42922,0.47501,2.26425
9,Australia,Australia and New Zealand,10,7.284,0.04083,1.33358,1.30923,0.93156,0.65124,0.35637,0.43562,2.26646


In [5]:
df.info() # get some info about the data, such as number of entries and features, etc.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158 entries, 0 to 157
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        158 non-null    object 
 1   Region                         158 non-null    object 
 2   Happiness Rank                 158 non-null    int64  
 3   Happiness Score                158 non-null    float64
 4   Standard Error                 158 non-null    float64
 5   Economy (GDP per Capita)       158 non-null    float64
 6   Family                         158 non-null    float64
 7   Health (Life Expectancy)       158 non-null    float64
 8   Freedom                        158 non-null    float64
 9   Trust (Government Corruption)  158 non-null    float64
 10  Generosity                     158 non-null    float64
 11  Dystopia Residual              158 non-null    float64
dtypes: float64(9), int64(1), object(2)
memory usage: 1

In [6]:
df.describe() # a quick descriptive summary of the data

Unnamed: 0,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
count,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0
mean,79.493671,5.375734,0.047885,0.846137,0.991046,0.630259,0.428615,0.143422,0.237296,2.098977
std,45.754363,1.14501,0.017146,0.403121,0.272369,0.247078,0.150693,0.120034,0.126685,0.55355
min,1.0,2.839,0.01848,0.0,0.0,0.0,0.0,0.0,0.0,0.32858
25%,40.25,4.526,0.037268,0.545808,0.856823,0.439185,0.32833,0.061675,0.150553,1.75941
50%,79.5,5.2325,0.04394,0.910245,1.02951,0.696705,0.435515,0.10722,0.21613,2.095415
75%,118.75,6.24375,0.0523,1.158448,1.214405,0.811013,0.549092,0.180255,0.309883,2.462415
max,158.0,7.587,0.13693,1.69042,1.40223,1.02525,0.66973,0.55191,0.79588,3.60214


## Median : 
### It is a robust central tendency measure resistant to outliers (extreme values) 

#### Question 1 : Find the median of each numerical feature of this dataset within each region, except the "Happiness Rank" feature

In [7]:
df.drop(['Country', 'Happiness Rank'], axis=1).groupby('Region').apply(lambda x: x.median())

Unnamed: 0_level_0,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Australia and New Zealand,7.285,0.03727,1.29188,1.31445,0.919965,0.64531,0.392795,0.455315,2.265355
Central and Eastern Europe,5.286,0.04267,1.01216,1.10614,0.73128,0.35068,0.04212,0.15275,2.025
Eastern Asia,5.729,0.037245,1.257675,1.067175,0.92034,0.466205,0.07993,0.219665,1.772375
Latin America and Caribbean,6.149,0.052975,0.9094,1.14643,0.69606,0.51954,0.10826,0.21457,2.7092
Middle East and Northern Africa,5.262,0.044525,1.01722,1.00012,0.72109,0.347435,0.140405,0.16795,1.998595
North America,7.273,0.03696,1.3604,1.28486,0.88371,0.589505,0.244235,0.42958,2.480935
Southeastern Asia,5.36,0.0433,0.70532,1.02,0.63793,0.55664,0.10501,0.40359,1.86399
Southern Asia,4.565,0.03225,0.59543,0.43106,0.56874,0.39786,0.09719,0.33671,1.95637
Sub-Saharan Africa,4.272,0.047775,0.308445,0.878375,0.298155,0.38291,0.103875,0.207305,1.95005
Western Europe,6.937,0.03595,1.30232,1.28907,0.89667,0.61477,0.21843,0.29678,2.12367


## Trimmed mean : an another robust measure of location
#### It is a statistical measure that removes a certain percentage of the largest and smallest values before calculating the mean. 
#### This helps reduce the effect of outliers (extreme values) and skewed data, giving a more robust central tendency measure
#### compared to the regular mean. It is widely used.

#### Question 2 : Calculate the happiness score trimmed mean of the world to reduce the influence of outliers.

#### 1st way : Calculate the trimmed mean manually using Pandas 

In [8]:
# Manual calculation (for understanding)
sorted_col = df['Happiness Score'].sort_values().reset_index(drop=True)

n = len(sorted_col)

trim_percent = 0.1  # 10%

k = int(n * trim_percent)

trimmed_col = sorted_col[k:-k] # drop 10% of data from the bottom and 10% of data from the top

trimmed_mean_1 = trimmed_col.mean()
trimmed_mean_1

5.3633984375

#### 2nd way : The efficient and concise way with SciPy
#### Calculate the trimmed mean automatically using the scipy.stats.trim_mean() function

In [9]:
# the parameter proportiontocut=0.1 means drop 10% of data from the bottom and 10% of data from the top
trimmed_mean_2 = sstats.trim_mean(df['Happiness Score'], proportiontocut=0.1) # using a 10% trim, we get a more accurate central value.
# or trimmed_mean_2 = sstats.trim_mean(df['Population'], 0.1)
trimmed_mean_2

5.3633984375

## Mean vs. Trimmed Mean: 
#### Question 3 : Calculate the average happiness score of each region, then their respective trimmed means.

#### Analys Report: Based on the data at hand, the region with the highest average of happiness is "Australia and New Zealand" which scored 
#### 7.285000, then comes "North America" with 7.273000 scores of happiness. Finally, "Sub-Saharan Africa" with score of 4.202800 is the 
#### lowest happiness region of the world.
#### Note: The mean and trimmed mean of each region are approximately equal, since the data within each region is normally distributed 
#### and there are not outliers.

In [10]:
mean_vs_trimmed_mean = df.groupby('Region')['Happiness Score'].apply(lambda x: {
    'mean:': x.mean(),
    'trimmed_mean:': sstats.trim_mean(x, 0.1) # trim 10% from both ends
})
mean_vs_trimmed_mean

Region                                        
Australia and New Zealand        mean:            7.285000
                                 trimmed_mean:    7.285000
Central and Eastern Europe       mean:            5.332931
                                 trimmed_mean:    5.345280
Eastern Asia                     mean:            5.626167
                                 trimmed_mean:    5.626167
Latin America and Caribbean      mean:            6.144682
                                 trimmed_mean:    6.192444
Middle East and Northern Africa  mean:            5.406900
                                 trimmed_mean:    5.429750
North America                    mean:            7.273000
                                 trimmed_mean:    7.273000
Southeastern Asia                mean:            5.317444
                                 trimmed_mean:    5.317444
Southern Asia                    mean:            4.580857
                                 trimmed_mean:    4.580857
Sub-Sahar

## Weighted Mean : 
#### An average where each value is multiplied by its weight, and the sum is divided by the total weight.

#### Formula :


#### Question 4.a : Within each region, compute the weighted mean of the two following factors: 
####      GDP per capita and life expectancy, both weighted by the family support

The groupby operation splits the DataFrame into groups based on 'Region', and for each group, selects the 'Economy' column. So each 'x' in the lambda is a Series of 'Economy' values for that region:

In [11]:
# For understanding how the groupby () method works with one column:
df.groupby('Region')['Economy (GDP per Capita)'].apply(lambda x: np.average(x, weights=df.loc[x.index, 'Family']))  

Region
Australia and New Zealand          1.291714
Central and Eastern Europe         0.959169
Eastern Asia                       1.146862
Latin America and Caribbean        0.892136
Middle East and Northern Africa    1.121820
North America                      1.359398
Southeastern Asia                  0.839238
Southern Asia                      0.614058
Sub-Saharan Africa                 0.417563
Western Europe                     1.303566
Name: Economy (GDP per Capita), dtype: float64

In [12]:
# First approach: 
weighted_means_1 = df.groupby('Region')[['Economy (GDP per Capita)', 'Health (Life Expectancy)']].apply(
    lambda x: np.average(x, axis=0, weights=df.loc[x.index, 'Family']) # Use group-specific weights
                                                                                                     )  
weighted_means_1

Region
Australia and New Zealand           [1.2917143991783637, 0.9199189534406025]
Central and Eastern Europe          [0.9591694910707005, 0.7188642314993371]
Eastern Asia                        [1.1468620353638868, 0.8704043014540911]
Latin America and Caribbean         [0.8921362601439689, 0.7104272366188898]
Middle East and Northern Africa     [1.1218197273897597, 0.7168929099718628]
North America                       [1.3593978266114595, 0.8843540234733746]
Southeastern Asia                   [0.8392384462105738, 0.6925395750696167]
Southern Asia                       [0.6140584902318889, 0.5773442482041066]
Sub-Saharan Africa                 [0.41756342701941085, 0.2832187103394575]
Western Europe                      [1.3035661360368702, 0.9092229618368637]
Name: (Economy (GDP per Capita), Health (Life Expectancy)), dtype: object

In [13]:
# Second approach: clear and concise
weighted_means_2 = df.groupby('Region')[['Economy (GDP per Capita)', 'Health (Life Expectancy)']].apply(
    lambda x: pd.Series({
        col: np.average(x[col], weights=df.loc[x.index, 'Family']) for col in x.columns # list comprehension
    })
)
weighted_means_2

Unnamed: 0_level_0,Economy (GDP per Capita),Health (Life Expectancy)
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Australia and New Zealand,1.291714,0.919919
Central and Eastern Europe,0.959169,0.718864
Eastern Asia,1.146862,0.870404
Latin America and Caribbean,0.892136,0.710427
Middle East and Northern Africa,1.12182,0.716893
North America,1.359398,0.884354
Southeastern Asia,0.839238,0.69254
Southern Asia,0.614058,0.577344
Sub-Saharan Africa,0.417563,0.283219
Western Europe,1.303566,0.909223


In [14]:
# Third approach:
# Calculate weighted mean for multiple columns by region
weighted_means_3 = df.groupby('Region')[['Economy (GDP per Capita)', 'Health (Life Expectancy)']].apply(
    lambda x: pd.Series({
        'Weighted GDP': np.average(x['Economy (GDP per Capita)'], weights=df.loc[x.index, 'Family']),
        'Weighted Health': np.average(x['Health (Life Expectancy)'], weights=df.loc[x.index, 'Family'])
    })
)
weighted_means_3

Unnamed: 0_level_0,Weighted GDP,Weighted Health
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Australia and New Zealand,1.291714,0.919919
Central and Eastern Europe,0.959169,0.718864
Eastern Asia,1.146862,0.870404
Latin America and Caribbean,0.892136,0.710427
Middle East and Northern Africa,1.12182,0.716893
North America,1.359398,0.884354
Southeastern Asia,0.839238,0.69254
Southern Asia,0.614058,0.577344
Sub-Saharan Africa,0.417563,0.283219
Western Europe,1.303566,0.909223


## Weighted Median : 
#### The value separating the higher half from the lower half of a dataset, where each data point contributes proportionally to its weight.

#### Formula :

#### Question 4.b : Compute the weighted median of the two following factors: 
####      GDP per capita and life expectancy, both weighted by the family support

In [15]:
weightes_medians = df.groupby('Region')[['Economy (GDP per Capita)', 'Health (Life Expectancy)']].apply(
    lambda x: pd.Series({
        col: wquantiles.median(x[col], weights=df.loc[x.index, 'Family']) for col in x.columns
    })
)
weightes_medians

Unnamed: 0_level_0,Economy (GDP per Capita),Health (Life Expectancy)
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Australia and New Zealand,1.291714,0.919919
Central and Eastern Europe,1.026008,0.733406
Eastern Asia,1.258108,0.920727
Latin America and Caribbean,0.919961,0.696613
Middle East and Northern Africa,1.085762,0.722934
North America,1.359398,0.884354
Southeastern Asia,0.789652,0.684096
Southern Asia,0.654318,0.572681
Sub-Saharan Africa,0.358103,0.300117
Western Europe,1.305127,0.896496


# Some Key Takeaways :

## Mode:

### This is the value that occurs most frequently in a dataset. It is the only measure of location used when the data is categorical.

### Trimmed mean is useful when there are outliers, so datasets with potential outliers in numerical features (columns) would be good. 
### Weighted mean and median require a weight feature (column), so datasets should have a variable that can serve as weights, like frequency, quantity, or importance.

#### E.g., some real-world data with natural weights :
#### - economic data might have income values with population weights, 
#### - sales data could have prices weighted by quantities sold, 
#### - survey data where responses are weighted by demographics.

### Pandas DataFrame:
#### - the groupby() method splits the DataFrame into groups based on its specific categorical feature (column) containing two or more unique values
#### - the apply() method is used to apply one or more aggregate functions to each group.