<div style="
    padding: 15px;
    margin: 10px 0;
    border: 1px solid #f7be0d;
    border-radius: 4px;
    background-color: #e3db24;
    color: #060606;
    font-size: 16px;
    line-height: 1.5;
    word-wrap: break-word;
    text-align: left;">
    <strong>Inferential Statistics</strong>
<p>This notebook builds on the notebook "EDA-maternal-employment" which can be found in the same repo. In "EDA-maternal-employment" hypotheses regarding influence factors on the maternal employment were put forward and tested with EDA methods. They will be put down here and further tested with inferential methods ONLY if a relationship was indicated in the previous work.</p>
</div>

<h1 style="color: #e3db24;">00 | Libraries and Settings</h1>

In [22]:
# 📚 Basic libraries
import pandas as pd
import glob
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# 🛞 Machine Learning
import xgboost as xgb
import statsmodels.api as sm
import statsmodels.stats.diagnostic as smd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

In [4]:
# ⚙️ Settings
pd.set_option('display.max_columns', None) # display all columns
pd.set_option('display.float_format', '{:,.2f}'.format)
import warnings
warnings.filterwarnings('ignore') # ignore warnings

<h1 style="color: #e3db24;">01 | Data Extraction</h1>

In [6]:
eda_df = pd.read_excel('data/eda_df.xlsx')
eu_df = pd.read_excel('data/eu_df.xlsx')

In [7]:
eu_df.head()

Unnamed: 0,country,gdp,fertility,unemployment,spending_family_benefits,pay_gap,age_first_child,women_15-49,emp_women,emp_men,emp_women_pt,emp_men_pt,manager_women,edu_women_score,edu_women_cat,emp_mothers_partnered,emp_mothers_single,fathers_full_paid_leave,mothers_full_paid_leave,emp_pt_maternal,emp_ft_maternal,emp_maternal
0,Austria,38560,1.48,5.3,2.7,12.14,29.7,37.2,70.3,77.9,50.1,12.2,35.5,2.1,High,75.83,73.27,9.39,51.2,40.38,35.18,75.56
1,Belgium,37170,1.6,5.3,2.06,1.11,29.2,37.2,63.3,69.9,38.1,10.7,35.4,2.21,High,77.99,64.04,6.65,15.05,19.72,55.63,75.53
2,Bulgaria,7730,1.58,4.4,1.59,8.81,26.4,33.3,67.4,73.9,1.5,1.3,38.3,2.11,High,71.95,67.1,1.93,73.14,0.61,68.52,71.13
3,Croatia,14630,1.58,6.0,1.85,3.21,29.0,35.2,62.1,69.4,4.5,3.0,28.9,2.07,Medium,77.43,81.5,7.87,47.62,3.87,73.13,77.79
4,Cyprus,28790,1.39,5.7,0.93,20.84,30.0,43.5,71.4,79.3,10.9,5.8,21.0,2.28,High,74.37,64.38,1.44,15.84,7.63,65.54,73.18


In [8]:
eda_df.head()

Unnamed: 0,country,emp_maternal,spending_family_benefits,fathers_full_paid_leave,emp_mothers_partnered,emp_mothers_single,edu_women_score,edu_women_cat,manager_women,emp_men_pt,pay_gap,spending_category
0,Austria,75.56,2.7,9.39,75.83,73.27,2.1,High,35.5,12.2,12.14,High
1,Belgium,75.53,2.06,6.65,77.99,64.04,2.21,High,35.4,10.7,1.11,Medium-High
2,Bulgaria,71.13,1.59,1.93,71.95,67.1,2.11,High,38.3,1.3,8.81,Medium-Low
3,Croatia,77.79,1.85,7.87,77.43,81.5,2.07,Medium,28.9,3.0,3.21,Medium-Low
4,Cyprus,73.18,0.93,1.44,74.37,64.38,2.28,High,21.0,5.8,20.84,Low


<h2 style="color: #ec7511;">Copy as Best Practice</h2>

In [10]:
inf_df = eda_df.copy()
regression_all = eu_df.copy()

## <span style="color: #ec7511;">Moving the Target to the Right</span>

In [None]:
#eu_df = eu_df[[col for col in eu_df.columns if col != "emp_women"] + ["emp_women"]]
#df_women = eu_df[[col for col in eu_df.columns if col != "emp_women"] + ["emp_women"]]
#df_mothers = eu_df[[col for col in eu_df.columns if col != "emp_mothers"] + ["emp_mothers"]]

<h1 style="color: #e3db24;">02 | Hypothesis Testing</h1>

<div style="
    padding: 15px;
    margin: 10px 0;
    border: 1px solid #f7be0d;
    border-radius: 4px;
    background-color: #e3db24;
    color: #060606;
    font-size: 16px;
    line-height: 1.5;
    word-wrap: break-word;
    text-align: left;">
    <strong>Overview Hypotheses:</strong>
<p>These are the four hypotheses for which the EDA analysis before indicated confirmation and a relationship between the variables.</p>
<ul>
    <li><strong>H2:</strong> Mothers in partnerships have a higher employment rate than single mothers. This difference is smaller in countries with higher spending on family benefits.</li>
    <li><strong>H3:</strong> Countries with longer fully paid parental leave for fathers have a higher maternal employment rate.</li>
    <li><strong>H4:</strong> The higher the education level of women in a country, the higher the maternal employment rate.</li>
    <li><strong>H6:</strong> In countries where more men work part-time, maternal employment rates are higher.</li>
</ul>
<p>In the following, I will analyse the detected relationship with inferential methods like t-tests, ANOVA and linear regression. Afterwards, I will build a comprehensive modell for a multiple regression which also includes control variables like GDP, unemployment and several more parameters.</p>
</div>

## <span style="color: #ec7511;">Hypothesis 2:</span>
<p>Mothers in partnerships have a higher employment rate than single mothers. This difference is smaller in countries with higher spending on family benefits.</p>

### <span style="color: #ec300e;">Independent Two Sample T-Test Between Single and Partnered Mothers</span>

In [17]:
# two sample t-test for the means of two independent samples
group1 = eda_df["emp_mothers_single"]
group2 = eda_df["emp_mothers_partnered"]

t_stat, p_value = stats.ttest_ind(group1, group2)

print(f"T-Statistik: {t_stat:.3f}")
print(f"p-Wert: {p_value:.3f}")
print("")

if p_value < 0.05:
    print("\n🚨 Significant difference between the means of the two groups (p < 0.05)")
else:
    print("\n❎ No significant difference between the means of the two group (p >= 0.05)")

T-Statistik: -1.759
p-Wert: 0.085


❎ No significant difference between the means of the two group (p >= 0.05)


### <span style="color: #ec300e;">One Way ANOVA on the Impact of Public Spending on the Employment Rate of Single/Partnered Mothers</span>

In [18]:
# ANOVA for single mothers
anova_single = stats.f_oneway(
    eda_df[eda_df["spending_category"] == "Low"]["emp_mothers_single"],
    eda_df[eda_df["spending_category"] == "Medium-Low"]["emp_mothers_single"],
    eda_df[eda_df["spending_category"] == "Medium-High"]["emp_mothers_single"],
    eda_df[eda_df["spending_category"] == "High"]["emp_mothers_single"]
)

# ANOVA for partnered mothers
anova_partnered = stats.f_oneway(
    eda_df[eda_df["spending_category"] == "Low"]["emp_mothers_partnered"],
    eda_df[eda_df["spending_category"] == "Medium-Low"]["emp_mothers_partnered"],
    eda_df[eda_df["spending_category"] == "Medium-High"]["emp_mothers_partnered"],
    eda_df[eda_df["spending_category"] == "High"]["emp_mothers_partnered"]
)

# Ergebnisse ausgeben
print("ANOVA for Single Mothers")
print(f"F-value: {anova_single.statistic:.3f}, p-value: {anova_single.pvalue:.3f}")
print("\nANOVA for Partnered Mothers")
print(f"F-value: {anova_partnered.statistic:.3f}, p-value: {anova_partnered.pvalue:.3f}")

# Interpretation
alpha = 0.05
if anova_single.pvalue < alpha:
    print("\n🚨 There is a significant difference in the employment rate of single mothers between the categories of spending on family benefits.")
else:
    print("\n❎ There is NO significant difference in the employment rate of single mothers between the categories of spending on family benefits.")

if anova_partnered.pvalue < alpha:
    print("\n🚨 There is a significant difference in the employment rate of partnered mothers between the categories of spending on family benefits.")
else:
    print("\n❎ There is NO significant difference in the employment rate of partnered mothers between the categories of spending on family benefits.")


ANOVA for Single Mothers
F-value: 1.699, p-value: 0.198

ANOVA for Partnered Mothers
F-value: 0.123, p-value: 0.946

❎ There is NO significant difference in the employment rate of single mothers between the categories of spending on family benefits.

❎ There is NO significant difference in the employment rate of partnered mothers between the categories of spending on family benefits.


<div style="
    padding: 15px;
    margin: 10px 0;
    border: 1px solid #2d0df7;
    border-radius: 4px;
    background-color: #0dd4f7;
    color: #060606;
    font-size: 16px;
    line-height: 1.5;
    word-wrap: break-word;
    text-align: left;">
    <strong>Conclusions: ...</strong>
<p>TEXT</p>
</div>

## <span style="color: #ec7511;">Hypothesis 3:</span>
<p>Countries with longer fully paid parental leave for fathers have a higher maternal employment rate.</p>

### <span style="color: #ec300e;">OLS Regression of Paternal Leave on Maternal Employment</span>

In [26]:
# independent variable (predictor)
X = eda_df["fathers_full_paid_leave"]
X = sm.add_constant(X)  # add constant term (intercept, the predicted y value when x=0)

# dependent variable (target)
y = eda_df["emp_maternal"]

# OLS regression
model = sm.OLS(y, X).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:           emp_maternal   R-squared:                       0.085
Model:                            OLS   Adj. R-squared:                  0.045
Method:                 Least Squares   F-statistic:                     2.144
Date:                Sat, 08 Feb 2025   Prob (F-statistic):              0.157
Time:                        15:28:57   Log-Likelihood:                -83.424
No. Observations:                  25   AIC:                             170.8
Df Residuals:                      23   BIC:                             173.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                     

<div style="
    padding: 15px;
    margin: 10px 0;
    border: 1px solid #2d0df7;
    border-radius: 4px;
    background-color: #0dd4f7;
    color: #060606;
    font-size: 16px;
    line-height: 1.5;
    word-wrap: break-word;
    text-align: left;">
    <strong>Conclusions: ...</strong>
<p>TEXT</p>
    <ul>
        <li>not significant</li>
        <li>TEXT</li>
        <li>TEXT</li>
        <li>TEXT</li>
    </ul>
</div>

## <span style="color: #ec7511;">Hypothesis 4:</span>
<p>The higher the education level of women in a country, the higher the maternal employment rate.</p>

### <span style="color: #ec300e;">One Way ANOVA on the Effect of Different Education Level Categories on Maternal Employment</span>

In [21]:
# ANOVA for influence of edu_women_cat on emp_maternal
anova_edu = stats.f_oneway(
    eda_df[eda_df["edu_women_cat"] == "Low"]["emp_maternal"],
    eda_df[eda_df["edu_women_cat"] == "Medium"]["emp_maternal"],
    eda_df[eda_df["edu_women_cat"] == "High"]["emp_maternal"]
)

print("ANOVA for the influence of edu_women_cat on emp_maternal")
print(f"F-value: {anova_edu.statistic:.3f}, p-value: {anova_edu.pvalue:.3f}")

# Interpretation
alpha = 0.05
if anova_edu.pvalue < alpha:
    print("\n🚨 There is a significant difference in the employment rate of mothers between the different categories of education levels.")
else:
    print("\n❎  There is NO significant difference in the employment rate of mothers between the different categories of education levels.")

ANOVA for the influence of edu_women_cat on emp_maternal
F-value: 2.776, p-value: 0.084

❎  There is NO significant difference in the employment rate of mothers between the different categories of education levels.


### <span style="color: #ec300e;">OLS Regression on the Effect of Different Education Level Scores on Maternal Employment</span>

In [20]:
# independent variable (predictor)
X = eda_df["edu_women_score"]
X = sm.add_constant(X)  # add constant term (intercept, the predicted y value when x=0)

# dependent variable (target)
y = eda_df["emp_maternal"]

# OLS regression
model = sm.OLS(y, X).fit()

print(model.summary())


                            OLS Regression Results                            
Dep. Variable:           emp_maternal   R-squared:                       0.217
Model:                            OLS   Adj. R-squared:                  0.183
Method:                 Least Squares   F-statistic:                     6.361
Date:                Sat, 08 Feb 2025   Prob (F-statistic):             0.0190
Time:                        14:54:04   Log-Likelihood:                -81.486
No. Observations:                  25   AIC:                             167.0
Df Residuals:                      23   BIC:                             169.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const              24.7332     19.777     

<div style="
    padding: 15px;
    margin: 10px 0;
    border: 1px solid #2d0df7;
    border-radius: 4px;
    background-color: #0dd4f7;
    color: #060606;
    font-size: 16px;
    line-height: 1.5;
    word-wrap: break-word;
    text-align: left;">
    <strong>Conclusions: ...</strong>
<p>YES FINALLY FOUND SOMETHING</p>
    <ul>
        <li>OLS interpretation: 18.3% of emp_maternal is explained by education level score. Significant: f-statistics p<0.05! The coefficient of 23.19 shows: an increase of 1 point in the education score would lead to an increase in the employment rate of 23.2 %.</li>
        <li>TEXT</li>
        <li>TEXT</li>
        <li>TEXT</li>
    </ul>
</div>

## <span style="color: #ec7511;">Hypothesis 6:</span>
<p>In countries where more men work part-time, maternal employment rates are higher.</p>

### <span style="color: #ec300e;">OLS Regression on the Effect of Male Part-Time Employment Rate on Maternal Employment</span>

In [27]:
# independent variable (predictor)
X = eda_df["emp_men_pt"]
X = sm.add_constant(X)  # add constant term (intercept, the predicted y value when x=0)

# dependent variable (target)
y = eda_df["emp_maternal"]

# OLS regression
model = sm.OLS(y, X).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:           emp_maternal   R-squared:                       0.099
Model:                            OLS   Adj. R-squared:                  0.060
Method:                 Least Squares   F-statistic:                     2.537
Date:                Sat, 08 Feb 2025   Prob (F-statistic):              0.125
Time:                        15:31:04   Log-Likelihood:                -83.231
No. Observations:                  25   AIC:                             170.5
Df Residuals:                      23   BIC:                             172.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         70.9265      2.650     26.761      0.0

<div style="
    padding: 15px;
    margin: 10px 0;
    border: 1px solid #2d0df7;
    border-radius: 4px;
    background-color: #0dd4f7;
    color: #060606;
    font-size: 16px;
    line-height: 1.5;
    word-wrap: break-word;
    text-align: left;">
    <strong>Conclusions: ...</strong>
<p>TEXT</p>
    <ul>
        <li>not significant</li>
        <li>TEXT</li>
        <li>TEXT</li>
        <li>TEXT</li>
    </ul>
</div>

<h1 style="color: #e3db24;">03 | Multiple Regression</h1>

<h1 style="color: #e3db24;">04 | Data Processing</h1>

<h2 style="color: #ec7511;">X-Y Split</h2>

<h2 style="color: #ec7511;">Normalizing the Data</h2>

<h1 style="color: #e3db24;">05 | Modeling</h1>

<h2 style="color: #ec7511;">Train-Test Split</h2>

<h2 style="color: #ec7511;">Model Validation</h2>

<h1 style="color: #e3db24;">06 | Improving the Model</h1>

<h1 style="color: #e3db24;">07 | Reporting</h1>