In [1]:
import pandas as pd
import warnings
import numpy as np
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 10000)

# SO Data

Our goal is to compare between the bachelor degree to master degree's salaries in stack overflow survey.
```
df['group1'] = np.where(df['FormalEducation'].str.contains('Master', case=False, na=False), 1, 0)

df['group2'] = np.where(df['FormalEducation'].str.contains('Bachelor', case=False, na=False), 1, 0)
```

In [2]:
df = pd.read_csv("outputs/so/clean_data.csv")
avg_group1 = np.mean(df.loc[df['group1']==1]['ConvertedCompYearly'])
avg_group2 = np.mean(df.loc[df['group2']==1]['ConvertedCompYearly'])
print(f"mean salary group1 = master degree is: {avg_group1}")
print(f"mean salary group2 = bachelor degree is: {avg_group2}")
print(f"salary mean diff is: {avg_group1-avg_group2}")

mean salary group1 = master degree is: 539667.8574363188
mean salary group2 = bachelor degree is: 635203.6378091428
salary mean diff is: -95535.78037282405


The negative diff means that in the survey master degrees got lower salaries.

Now we want to search if there are groups that have opposite relation (positive diff).

Lets look at all the groups that have at least 0.1<support and are defined with the 
attributes = ['Gender', 'Hobby', 'Student', 'LastNewJob', 'Exercise']


We will run the DivExplorer code:

In [3]:
df_step1 = pd.read_csv("outputs/so/interesting_subpopulations.csv")
df_step1

Unnamed: 0,support,itemset,GROUP_1,GROUP_2,ConvertedCompYearly,ConvertedCompYearly_group1,ConvertedCompYearly_group2,ConvertedCompYearly_div,length,support_count
0,0.103109,"frozenset({'Exercise=3 - 4 times per week', 'Hobby=Yes', 'Student=No'})",2066.0,4040.0,858202.5,1052558.0,758811.8,293746.071203,3,6106.0
1,0.104814,"frozenset({'Gender=Male', 'Hobby=Yes', 'Exercise=3 - 4 times per week'})",2044.0,4163.0,876601.2,1058275.0,787400.6,270874.92024,3,6207.0
2,0.149631,"frozenset({'Student=Yes, full-time'})",2460.0,6401.0,284011.8,434767.5,226074.1,208693.353272,1,8861.0
3,0.110792,"frozenset({'Gender=Male', 'Student=No', 'Exercise=3 - 4 times per week'})",2186.0,4375.0,977119.0,1109402.0,911022.9,198379.089827,3,6561.0
4,0.124284,"frozenset({'Exercise=3 - 4 times per week', 'Hobby=Yes'})",2439.0,4921.0,811657.3,936044.3,750007.3,186036.989471,2,7360.0
5,0.130448,"frozenset({'Gender=Male', 'Exercise=3 - 4 times per week'})",2539.0,5186.0,932858.3,1041341.0,879746.6,161594.378901,2,7725.0
6,0.122106,"frozenset({'Hobby=Yes', 'Student=Yes, full-time'})",1983.0,5248.0,271299.3,375147.0,232059.6,143087.390771,2,7231.0
7,0.132018,"frozenset({'Exercise=3 - 4 times per week', 'Student=No'})",2638.0,5180.0,937590.8,989386.7,911212.9,78173.859842,2,7818.0
8,0.225468,"frozenset({'LastNewJob=Less than a year ago', 'Hobby=Yes'})",4007.0,9345.0,714024.0,763800.6,692680.5,71120.031678,2,13352.0
9,0.163022,"frozenset({'Gender=Male', 'LastNewJob=Less than a year ago', 'Hobby=Yes'})",2915.0,6739.0,868177.9,914000.6,848357.0,65643.512306,3,9654.0


We can see in the first 16 rows there are different subpopulations with opposite relation!

The master(ConvertedCompYearly_group1 column) got higher salaries than the bachelor (ConvertedCompYearly_group2 column).

For example in the first row we see that the masters with fields: frozenset({'Exercise=3 - 4 times per week', 'Hobby=Yes', 'Student=No'}) averaged 1,052,558 but the bachelors averaged 758,811 -> diff of 293,746!

Now we want to see the reason, lets run the second step with the possible 

attributes = ['YearsCodingProf', 'Country','Age', 'RaceEthnicity', 'JobSatisfaction', 'HopeFiveYears', 'WakeTime', 'DevType']

In [4]:
df_step2 = pd.read_csv("outputs/so/subpopulations_and_treatments.csv")
df_step2

Unnamed: 0,itemset,treatment
0,"frozenset({'Exercise=3 - 4 times per week', 'Hobby=Yes', 'Student=No'})",Country_United States+RaceEthnicity_WhiteorofEuropeandescent_1
1,"frozenset({'Gender=Male', 'Hobby=Yes', 'Exercise=3 - 4 times per week'})",Country_United States
2,"frozenset({'Student=Yes, full-time'})",Country_United States+Age_25 - 34 years old
3,"frozenset({'Gender=Male', 'Student=No', 'Exercise=3 - 4 times per week'})",Country_United States
4,"frozenset({'Exercise=3 - 4 times per week', 'Hobby=Yes'})",Country_United States+RaceEthnicity_WhiteorofEuropeandescent_1
5,"frozenset({'Gender=Male', 'Exercise=3 - 4 times per week'})",Country_United States
6,"frozenset({'Hobby=Yes', 'Student=Yes, full-time'})",Country_United States+Age_25 - 34 years old
7,"frozenset({'Exercise=3 - 4 times per week', 'Student=No'})",Country_United States+RaceEthnicity_WhiteorofEuropeandescent_1
8,"frozenset({'LastNewJob=Less than a year ago', 'Hobby=Yes'})",Age_55 - 64 years old
9,"frozenset({'Gender=Male', 'LastNewJob=Less than a year ago', 'Hobby=Yes'})",Age_55 - 64 years old


We can see that there are different treatments for differenct itemsets.

In the itemset mentioned above (first row here as well) we got the explanation: Country=United States & RaceEthnicity=WhiteorofEuropeandescent

Lets check if it aligns with the data:

In [5]:
df_subpopulation = df.loc[(df["Exercise"]=="3 - 4 times per week") & (df["Hobby"]=="Yes") & (df["Student"]=="No")]
avg_group1 = np.mean(df_subpopulation.loc[df['group1']==1]['ConvertedCompYearly'])
avg_group2 = np.mean(df_subpopulation.loc[df['group2']==1]['ConvertedCompYearly'])
print(f"mean salary group1 = master degree is: {avg_group1}")
print(f"mean salary group2 = bachelor degree is: {avg_group2}")
print(f"salary mean diff is: {avg_group1-avg_group2}")

mean salary group1 = master degree is: 1052557.916747338
mean salary group2 = bachelor degree is: 758811.8455445544
salary mean diff is: 293746.0712027835


In [6]:
df_treated = df_subpopulation.loc[(df_subpopulation["Country"]=="United States") & (df_subpopulation["RaceEthnicity_WhiteorofEuropeandescent"]==1.0)]
avg_group1 = np.mean(df_treated.loc[df_treated['group1']==1]['ConvertedCompYearly'])
avg_group2 = np.mean(df_treated.loc[df_treated['group2']==1]['ConvertedCompYearly'])
print(f"mean salary group1 = master degree is: {avg_group1}")
print(f"mean salary group2 = bachelor degree is: {avg_group2}")
print(f"salary mean diff is: {avg_group1-avg_group2}")

mean salary group1 = master degree is: 2472876.4872521246
mean salary group2 = bachelor degree is: 1154974.7530864198
salary mean diff is: 1317901.7341657048


The actual results show that for this treatment the salary of the masters group had a greater increase that does not compare to the increase of the bachelors group salary, the difference was even bigger!

For the final results let pick the top K=5 facts from all the data:

In [7]:
df_step3 = pd.read_csv("outputs/so/find_k/5_0.5_9e-05.csv")
df_step3

Unnamed: 0,itemset,treatment,ate1,ate2,iscore,size_itemset,size_group1,size_group2,support,ni_score,utility,std,diff_means
0,"{'Hobby': 'Yes', 'Exercise': '3 - 4 times per week', 'Student': 'No'}","[{'att': 'Country', 'value': <function parse_treatment.<locals>.<lambda> at 0x00000198B5E98940>, 'val_specified': 'United States'}, {'att': 'RaceEthnicity_WhiteorofEuropeandescent', 'value': <function parse_treatment.<locals>.<lambda> at 0x00000198B5E988B0>, 'val_specified': 1.0}]",1713005.0,583271.9,1129733.0,6106,2066,4040,0.103109,1.0,1.0,6780992.0,293746.071203
1,"{'Student': 'Yes, full-time'}","[{'att': 'Country', 'value': <function parse_treatment.<locals>.<lambda> at 0x00000198B5DE21F0>, 'val_specified': 'United States'}, {'att': 'Age', 'value': <function parse_treatment.<locals>.<lambda> at 0x00000198B5E98700>, 'val_specified': '25 - 34 years old'}]",3224831.0,1265303.0,1959529.0,8861,2460,6401,0.149631,1.0,1.0,3557639.0,208693.353272
2,"{'Hobby': 'Yes', 'Exercise': ""I don't typically exercise"", 'Student': 'No'}","[{'att': 'RaceEthnicity_WhiteorofEuropeandescent', 'value': <function parse_treatment.<locals>.<lambda> at 0x00000198B5C9D8B0>, 'val_specified': 1.0}]",238189.9,567735.6,329545.7,9938,3240,6698,0.167818,1.0,1.0,5533553.0,-222233.98509
3,"{'Hobby': 'No', 'Student': 'No'}","[{'att': 'Country', 'value': <function parse_treatment.<locals>.<lambda> at 0x00000198B5C9D5E0>, 'val_specified': 'United States'}]",911667.6,1214402.0,302734.9,10156,3450,6706,0.171499,1.0,1.0,6335351.0,-222674.287851
4,"{'Hobby': 'Yes', 'Exercise': '1 - 2 times per week', 'Student': 'No'}","[{'att': 'Country', 'value': <function parse_treatment.<locals>.<lambda> at 0x00000198B5C8B1F0>, 'val_specified': 'United States'}]",678418.8,1640943.0,962524.1,8622,3224,5398,0.145595,1.0,1.0,7528571.0,-490583.491174


# MEPS Data

Our goal is to compare between the young people to the old people if they are exercise (at least 3 times at week) or not.

Value 1 = "yes"

Value 2 = "no"

```
df['group1'] = df['AGE'].apply(lambda x: 1 if x <= 30 else 0)

df['group2'] = df['AGE'].apply(lambda x: 1 if x >= 60 else 0)
```

In [8]:
df = pd.read_csv("outputs/meps/clean_data.csv")
avg_group1 = np.mean(df.loc[df['group1']==1]['Exercise'])
avg_group2 = np.mean(df.loc[df['group2']==1]['Exercise'])
print(f"mean exercise group1 = young is: {avg_group1}")
print(f"mean exercise group2 = old is: {avg_group2}")
print(f"mean diff is: {avg_group1-avg_group2}")

mean exercise group1 = young is: 1.4385964912280702
mean exercise group2 = old is: 1.5389841475250727
mean diff is: -0.10038765629700253


The negative diff means that the younger do more exercise than the older.

Now we want to search if there are groups that have opposite relation (positive diff).

Lets look at all the groups that have at least 0.1<support and are defined with the 
attributes = ['Married', 'IsHadStroke', 'DoesDoctorRecommendExercise', 'IsWorking', 'CurrentlySmoke']


We will run the DivExplorer code:

In [9]:
df_step1 = pd.read_csv("outputs/meps/interesting_subpopulations.csv")
df_step1

Unnamed: 0,support,itemset,GROUP_1,GROUP_2,Exercise,Exercise_group1,Exercise_group2,Exercise_div,length,support_count
0,0.312721,"frozenset({'DoesDoctorRecommendExercise=2.0', 'IsHadStroke=2.0', 'IsWorking=1.0'})",1549.0,398.0,1.381613,1.395739,1.326633,0.069106,3,1947.0
1,0.315612,"frozenset({'DoesDoctorRecommendExercise=2.0', 'IsWorking=1.0'})",1552.0,413.0,1.381679,1.394974,1.331719,0.063255,2,1965.0
2,0.319949,"frozenset({'IsHadStroke=2.0', 'CurrentlySmoke=0', 'Married=1.0'})",571.0,1421.0,1.495482,1.509632,1.489796,0.019836,3,1992.0
3,0.339544,"frozenset({'CurrentlySmoke=0', 'Married=1.0'})",574.0,1540.0,1.506149,1.508711,1.505195,0.003516,2,2114.0
4,0.344523,"frozenset({'IsHadStroke=2.0', 'Married=1.0'})",624.0,1521.0,1.499767,1.501603,1.499014,0.002589,2,2145.0
5,0.522647,"frozenset({'DoesDoctorRecommendExercise=2.0', 'IsHadStroke=2.0', 'CurrentlySmoke=0'})",2089.0,1165.0,1.430854,1.427956,1.436052,-0.008096,3,3254.0
6,0.366849,frozenset({'Married=1.0'}),627.0,1657.0,1.510946,1.500797,1.514786,-0.013988,1,2284.0
7,0.585288,"frozenset({'DoesDoctorRecommendExercise=2.0', 'IsHadStroke=2.0'})",2353.0,1291.0,1.429199,1.41904,1.447715,-0.028675,2,3644.0
8,0.542724,"frozenset({'DoesDoctorRecommendExercise=2.0', 'CurrentlySmoke=0'})",2092.0,1287.0,1.440367,1.42782,1.460761,-0.032941,2,3379.0
9,0.407485,"frozenset({'IsHadStroke=2.0', 'CurrentlySmoke=0', 'IsWorking=1.0'})",1801.0,736.0,1.430824,1.420877,1.455163,-0.034286,3,2537.0


We can see in the first 5 rows that there are different subpopulations with opposite relation!

The older(Exercise_group1 column) got higher exercise average than the older (Exercise_group2 column)-> meaning they do less sport.

For example in the first row we will see that the yound group with fields: frozenset({'DoesDoctorRecommendExercise=2.0', 'IsHadStroke=2.0', 'IsWorking=1.0'}) averaged 1.395739 but the old group averaged 1.326633	 -> diff of 0.069106!

Now we want to see the reason, lets run the second step with the posible

attributes = ['Region', 'Race', 'Sex', 'Education','IsHadHeartAttack','IsDiagnosedAsthma', 'IsBornInUSA',LongSinceLastFluVaccination', 'TakesAspirinFrequently','WearsSeatBelt', 'HoldHealthInsurance']

In [10]:
df_step2 = pd.read_csv("outputs/meps/subpopulations_and_treatments.csv")
df_step2

Unnamed: 0,itemset,treatment
0,"frozenset({'DoesDoctorRecommendExercise=2.0', 'IsHadStroke=2.0', 'IsWorking=1.0'})",IsBornInUSA_1.0
1,"frozenset({'DoesDoctorRecommendExercise=2.0', 'IsWorking=1.0'})",IsBornInUSA_1.0
2,"frozenset({'CurrentlySmoke=0', 'Married=1.0'})",Race_4
3,"frozenset({'IsHadStroke=2.0', 'Married=1.0'})",Education_5.0
4,"frozenset({'DoesDoctorRecommendExercise=2.0', 'IsHadStroke=2.0', 'CurrentlySmoke=0'})",Race_3+Education_1.0
5,frozenset({'Married=1.0'}),Education_3.0
6,"frozenset({'DoesDoctorRecommendExercise=2.0', 'IsHadStroke=2.0'})",Education_1.0
7,"frozenset({'DoesDoctorRecommendExercise=2.0', 'CurrentlySmoke=0'})",Race_3
8,"frozenset({'IsHadStroke=2.0', 'CurrentlySmoke=0', 'IsWorking=1.0'})",IsBornInUSA_2.0
9,"frozenset({'CurrentlySmoke=0', 'IsWorking=1.0'})",IsBornInUSA_1.0


We can see that there are different treatments for differenct itemsets.

In the itemset mentioned above (first row here as well) we got the explanation: IsBornInUSA=1

Lets check if it aligns with the data:

In [11]:
df_subpopulation = df.loc[(df["DoesDoctorRecommendExercise"]==2.0) & (df["IsHadStroke"]==2.0) & (df["IsWorking"]==1.0)]
avg_group1 = np.mean(df_subpopulation.loc[df['group1']==1]['Exercise'])
avg_group2 = np.mean(df_subpopulation.loc[df['group2']==1]['Exercise'])
print(f"mean exercise group1 = young is: {avg_group1}")
print(f"mean exercise group2 = old is: {avg_group2}")
print(f"mean diff is: {avg_group1-avg_group2}")

mean exercise group1 = young is: 1.395739186571982
mean exercise group2 = old is: 1.3266331658291457
mean diff is: 0.06910602074283623


In [12]:
df_treated = df_subpopulation.loc[df_subpopulation["IsBornInUSA"]==1.0]
avg_group1 = np.mean(df_treated.loc[df_treated['group1']==1]['Exercise'])
avg_group2 = np.mean(df_treated.loc[df_treated['group2']==1]['Exercise'])
print(f"mean exercise group1 = young is: {avg_group1}")
print(f"mean exercise group2 = old is: {avg_group2}")
print(f"mean diff is: {avg_group1-avg_group2}")

mean exercise group1 = young is: 1.3676703645007924
mean exercise group2 = old is: 1.288888888888889
mean diff is: 0.0787814756119034


The actual results show that for this treatment the young group didn't have much of a change compared to the old group which decreased to a lower difference, the difference got even bigger!

For the final results let pick the top K=5 facts from all the data:

In [13]:
df_step3 = pd.read_csv("outputs/meps/find_k/5_0.5_9e-05.csv")
df_step3

Unnamed: 0,itemset,treatment,ate1,ate2,iscore,size_itemset,size_group1,size_group2,support,ni_score,utility,std,diff_means
0,"{'DoesDoctorRecommendExercise': 2.0, 'IsHadStroke': 2.0, 'CurrentlySmoke': 0.0}","[{'att': 'Race', 'value': <function parse_treatment.<locals>.<lambda> at 0x0000026E0CDDB790>, 'val_specified': 3.0}, {'att': 'Education', 'value': <function parse_treatment.<locals>.<lambda> at 0x0000026E0CDDBEE0>, 'val_specified': 1.0}]",0.273353,0.565891,0.292539,3254,2089,1165,0.522647,0.000292,0.000292,0.495196,-0.008096
1,"{'CurrentlySmoke': 0.0, 'Married': 1.0}","[{'att': 'Race', 'value': <function parse_treatment.<locals>.<lambda> at 0x0000026E0CDDB5E0>, 'val_specified': 4.0}]",0.173622,-0.081288,0.25491,2114,574,1540,0.339544,0.000255,0.000255,0.499962,0.003516
2,"{'DoesDoctorRecommendExercise': 2.0, 'Married': 5.0}","[{'att': 'Education', 'value': <function parse_treatment.<locals>.<lambda> at 0x0000026E0CE68670>, 'val_specified': 1.0}]",0.064612,0.201058,0.136446,1929,1825,104,0.30983,0.000136,0.000136,0.491749,-0.07529
3,"{'IsWorking': 4.0, 'IsHadStroke': 2.0, 'CurrentlySmoke': 0.0}","[{'att': 'IsBornInUSA', 'value': <function parse_treatment.<locals>.<lambda> at 0x0000026E0CE689D0>, 'val_specified': 1.0}]",-0.138174,0.058542,0.196716,2507,770,1737,0.402666,0.000197,0.000197,0.499436,-0.054873
4,"{'DoesDoctorRecommendExercise': 2.0, 'IsHadStroke': 2.0, 'IsWorking': 1.0}","[{'att': 'IsBornInUSA', 'value': <function parse_treatment.<locals>.<lambda> at 0x0000026E0CDDBB80>, 'val_specified': 1.0}]",-0.151493,-0.180991,0.029497,1947,1549,398,0.312721,2.9e-05,2.9e-05,0.485782,0.069106
