🧠 ANOVA Salary Comparison across Four Companies


Imagine four tech companies: Google, Microsoft, Amazon, and Tesla.

I have  collected salary samples (in thousands) for a similar role across these companies.
I want to determine whether there’s a significant difference in average salaries
between these companies using ANOVA.

 Hypotheses:
# H0: The mean salaries of all four companies are equal.
# H1: At least one of the company’s mean salary is significantly different

In [7]:
import pandas as pd
import numpy as np

# Create my dataset
data = {
    'Google': [25, 35, 40, 28, 36, 29],
    'Microsoft': [45, 39, 29, 56, 40],
    'Amazon': [54, 35, 33, 37, 27],
    'Tesla': [51, 62, 73, 69, 70]
}

In [8]:
df = pd.DataFrame(dict([(k, pd.Series(v)) for k, v in data.items()]))# key value incase the array is of differrent length

In [23]:
print(" The Salary Data (in thousands)")

df

 The Salary Data (in thousands)


Unnamed: 0,Google,Microsoft,Amazon,Tesla
0,25,45.0,54.0,51.0
1,35,39.0,35.0,62.0
2,40,29.0,33.0,73.0
3,28,56.0,37.0,69.0
4,36,40.0,27.0,70.0
5,29,41.8,37.2,65.0


In [12]:
from scipy import stats

#dropping na values as array length is different

fvalue, pvalue = stats.f_oneway(
    df['Google'].dropna(),
    df['Microsoft'].dropna(),
    df['Amazon'].dropna(),
    df['Tesla'].dropna()
)

print(f"F-value: {fvalue:.4f}, P-value: {pvalue:.6f}")

F-value: 14.7194, P-value: 0.000056


# 🧩 Interpretation:
# If the p-value < 0.05 → Rejecting H0 → At least one company’s mean salary is different.
# If the p-value > 0.05 → Fail to reject H0 → No significant difference between means.

Since the p-value (0.000056) is less than α = 0.05,
we reject the null hypothesis and conclude that there is a statistically significant difference in the mean salaries among the four companies.
Further analysis using the Tukey HSD test will help me  identify which specific companies differ in their mean salaries.

In [17]:
df.isna().sum() # there were some null values so it can create an issue in tukey test

Unnamed: 0,0
Google,0
Microsoft,1
Amazon,1
Tesla,1


In [19]:
df = df.apply(lambda col: col.fillna(col.mean()), axis=0)#imputing mean in null values

# a close impute will be mean as its salary for Honestly significant tukey test

In [21]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

df_melt = pd.melt(df.reset_index(), id_vars=['index'], value_vars=['Google', 'Microsoft', 'Amazon', 'Tesla'])
df_melt.columns = ['index', 'Company', 'Salary']
tukey = pairwise_tukeyhsd(endog=df_melt['Salary'], groups=df_melt['Company'], alpha=0.05)
print(tukey)

    Multiple Comparison of Means - Tukey HSD, FWER=0.05    
  group1    group2  meandiff p-adj   lower    upper  reject
-----------------------------------------------------------
   Amazon    Google  -5.0333 0.6967 -17.9025  7.8359  False
   Amazon Microsoft      4.6 0.7509  -8.2692 17.4692  False
   Amazon     Tesla     27.8    0.0  14.9308 40.6692   True
   Google Microsoft   9.6333 0.1887  -3.2359 22.5025  False
   Google     Tesla  32.8333    0.0  19.9641 45.7025   True
Microsoft     Tesla     23.2 0.0003  10.3308 36.0692   True
-----------------------------------------------------------


ANOVA confirmed that at least one company’s mean salary differs,
and Tukey HSD identified Tesla as the company with significantly higher mean salaries compared to the rest.

In [22]:
print(df.mean())# tesla is really paying highest

Google       32.166667
Microsoft    41.800000
Amazon       37.200000
Tesla        65.000000
dtype: float64
