# Practice 1. Sampling bias. Weighting sample. Resampling. Bootstrapping.


In [None]:
import pandas as pd
#!pip install pyreadstat

## Data Preparation

We will work with The TIMSS 2019 International Database, with the questionnaire for 8th graders about mathematics.
https://timss2019.org/international-database/

Codebook: https://timss2019.org/international-database/downloads/T19_UG_Supp1-international-context-questionnaires.pdf (from page 183)



+ BSMMAT01 - maths score (Mathematics Achievement, points scored by the student for tasks in mathematics)

Variables from the student questionnaire:

some control variables:
+ BSBG01 - sex, boy\girl 
+ BSDGEDUP - Highest education of the parents

quiestions about maths and maths classes:
+ BSBM16A-I - interest in maths (from love to maths to not interested at all)
+ BSBM17A-G - Teaching assessment (teacher’s explanations are clear and interesting things learned and he/she help to not clear, no help, not interesting)
+ BSBM19A-I - success in maths (a, d, f, g from success to not success, b, c, e, h, i vice versa)
+ BSBM20A-I - attitude to maths, its importance for life (from important to not important at all)


In [None]:
# reading the data

data = pd.read_spss("timss_data.sav")
data.head()

In [None]:
# selecting columns we need

import string

BSBM16_vars = ['BSBM16' + x for x in string.ascii_uppercase[:9]]
BSBM17_vars = ['BSBM17' + x for x in string.ascii_uppercase[:7]]
BSBM19_vars = ['BSBM19' + x for x in string.ascii_uppercase[:9]]
BSBM20_vars = ['BSBM20' + x for x in string.ascii_uppercase[:9]]

cols = ['BSMMAT01', 'BSBG01', 'BSDGEDUP'] + BSBM16_vars + BSBM17_vars + BSBM19_vars + BSBM20_vars

data = data[cols]

In [None]:
data.head()

**recoding variables**

let's recode them in a way that we have variables from not interst/not understand/success (1) to interst/understand/success (4)


In [None]:
data[BSBM16_vars] = data[BSBM16_vars].replace({'Agree a lot': 4, 'Agree a little': 3, 'Disagree a little': 2, 'Disagree a lot': 1})
data[BSBM17_vars] = data[BSBM17_vars].replace({'Agree a lot': 4, 'Agree a little': 3, 'Disagree a little': 2, 'Disagree a lot': 1})
data[BSBM20_vars] = data[BSBM20_vars].replace({'Agree a lot': 4, 'Agree a little': 3, 'Disagree a little': 2, 'Disagree a lot': 1})

In [None]:
# BSBM19A-I - success in maths (a, d, f, g from success to not success, b, c, e, h, i vice versa)
BSBM19_vars

data[["BSBM19A", "BSBM19D", "BSBM19F", "BSBM19G"]] = data[["BSBM19A", "BSBM19D", "BSBM19F", "BSBM19G"]].replace(
    {'Agree a lot': 4, 'Agree a little': 3, 'Disagree a little': 2, 'Disagree a lot': 1})
data[["BSBM19B", "BSBM19C", "BSBM19E", "BSBM19H", "BSBM19I"]] = data[
    ["BSBM19B", "BSBM19C", "BSBM19E", "BSBM19H", "BSBM19I"]].replace(
    {'Agree a lot': 1, 'Agree a little': 2, 'Disagree a little': 3, 'Disagree a lot': 4})

In [None]:
# coverting to numeric
data[BSBM16_vars + BSBM17_vars + BSBM19_vars + BSBM20_vars] = data[
    BSBM16_vars + BSBM17_vars + BSBM19_vars + BSBM20_vars].astype(float)

In [None]:
data.head()

In [None]:
# omit all NAs for simplicity
data = data.dropna()

In [None]:
# now let's calculate average value for each category of variables

data['interest'] = data[BSBM16_vars].mean(axis=1)
data['teaching'] = data[BSBM17_vars].mean(axis=1).copy()
data['success'] = data[BSBM19_vars].mean(axis=1).copy()
data['importance'] = data[BSBM20_vars].mean(axis=1).copy()

In [None]:
# prepare target and control variables

data = data.rename(columns={"BSBG01": "sex", "BSDGEDUP": "education", "BSMMAT01": "math_score"})
data['math_score'] = data['math_score'].astype(float)

data['sex'] = data['sex'].replace({"Boy": 0, "Girl": 1}).astype(int)

In [None]:
data['education'].value_counts()

In [None]:
# let's remain 4 categories - secondary (or lower), post-secondary, university and don't know

data['education'] = data['education'].replace({'Some Primary, Lower Secondary or No School' : 'Secondary or lower',
                                               'Lower Secondary' : 'Secondary or lower', 
                                               'Upper Secondary' : 'Secondary or lower',
                                               'Post-secondary but not University' : 'Post-Secondary'})

In [None]:
data['education'].value_counts()

## Bias data

In [None]:
data = data.sample(frac=1., random_state=10)

In [None]:
data['sex'].value_counts()

In [None]:
# let's randomly drop rows and make bias by sex in our data
import numpy as np

np.random.seed(110)
drop_indices = np.random.choice(data[data['sex'] == 1].index, 1000, replace=False)
data_sex_biased = data.drop(drop_indices)

In [None]:
data_sex_biased['sex'].value_counts()

## Testing hypotheses


1. There is no differences in math score between girls and boys.

In [None]:
from scipy import stats

In [None]:
# Welch t-test (unknown variances, we know anything about their equality)
res_biased = stats.ttest_ind(data_sex_biased[data_sex_biased['sex'] == 0]['math_score'], 
                             data_sex_biased[data_sex_biased['sex'] == 1]['math_score'], 
                             equal_var=False)
print(res_biased)

In [None]:
data_sex_biased.groupby(by=['sex']).mean()['math_score']

In [None]:
res_no_bias = stats.ttest_ind(data[data['sex'] == 0]['math_score'], 
                              data[data['sex'] == 1]['math_score'], 
                              equal_var=False)
print(res_no_bias)

In [None]:
data.groupby(by=['sex']).mean()['math_score']

## Weighting

a kind of Post-stratification weighting

In [None]:
boys_weight = 0.5/((len(data_sex_biased[data_sex_biased['sex'] == 0])/len(data_sex_biased)))
girls_weight = 0.5/((len(data_sex_biased[data_sex_biased['sex'] == 1])/len(data_sex_biased)))

In [None]:
from statsmodels.stats import weightstats

weightstats.ttest_ind(data_sex_biased[data_sex_biased['sex'] == 0]['math_score'], 
                             data_sex_biased[data_sex_biased['sex'] == 1]['math_score'], 
                             usevar='unequal', weights=([boys_weight]*(len(data_sex_biased[data_sex_biased['sex'] == 0])), [girls_weight]*(len(data_sex_biased[data_sex_biased['sex'] == 1]))))

In [None]:
data_sex_biased['weights'] = data_sex_biased['sex'].apply(lambda x: boys_weight if x == 0 else girls_weight)

In [None]:
# or we can just sample with weights and then perform tests

data_weighted = data_sex_biased.sample(n=len(data_sex_biased), replace=True, weights="weights")

In [None]:
stats.ttest_ind(data_weighted[data_weighted['sex'] == 0]['math_score'], 
                              data_weighted[data_weighted['sex'] == 1]['math_score'], 
                              equal_var=False)

other functions in the weightstats: https://www.statsmodels.org/dev/_modules/statsmodels/stats/weightstats.html

## Bootstrapping

bootstrapping is a resampling technique that involves **repeatedly drawing samples** from our source data **with replacement**, often to estimate a population parameter.

Algorithm:

1. Draw a sample of size N from the original dataset with replacement. This is a bootstrap sample.
2. Repeat step 1 S times, so that we have S bootstrap samples.
3. Estimate our value on each of the bootstrap samples, so that we have S estimates
4. Use the distribution of estimates for inference (for example, estimating the confidence intervals).

with *stats.bootstrap* function we can build confidence intervals for some statistics

In [None]:
res = stats.bootstrap((data[data['sex'] == 0]['math_score']-data[data['sex'] == 1]['math_score'],), np.mean, confidence_level=0.95,
                random_state=10)
print(res.confidence_interval)

bootstrapping for testing mean difference

In [None]:
n_resamples = 3000
sample_size = len(data)

diff_distr = []

for i in range(n_resamples):
    sample_index = np.random.choice(len(data), sample_size)
    mean_diff = int(data.iloc[sample_index, ][data['sex']==0]['math_score'].mean()) - int(data.iloc[sample_index, ][data['sex']==1]['math_score'].mean())
    diff_distr.append(mean_diff)

In [None]:
pd.Series(diff_distr).hist()

In [None]:
# confidence interval
left = np.percentile(diff_distr, 0.05/2*100)
right = np.percentile(diff_distr, 100-0.05/2*100)
(left, right)

## Task for you (Deadline: 13.09.2022 09:00)
Send me on email aspestova@hse.ru with the topic "HW[number] [Your name]"

1. Come up with some hypothesis on the relation between parents' highest education and math achievements
2. Test this hypothesis using suitable statistical test on biased and non-biased data. Are there any differences in the results? Describe them.
3. Try to reweight biased sample or run the test again. What have changed? (Consider the proportions of different education categories in the data as the proportions for general population)
4. Using bootrstrap estimate confidence intervals of tagret variable for each education category you analyze. (on original and biased sample)
5. Using Jackknife algorithm estimate confidence intervals -//-. (on original and biased sample)

! You can construct more complicated hypotheses and perform other tests, feel free to use other variables in the data.


**Reminder: Jackknife algorithm **

The name refers to cutting the data

Steps:
+ Remove a single observation, 
+ Calculate the statistic without that one value, 
+ Repeat that process for each observation (remove just one value, calculate the statistic). 
+ The estimation of a parameter derived from this smaller sample is called partial estimate. A pseudo-value is then computed as the difference between the whole sample estimate and the partial estimate.

In [None]:
data_task = data['education']


np.random.seed(110)
drop_indices1 = np.random.choice(data[data['education'] == 'Post-Secondary'].index, 300, replace=False)
drop_indices2 = np.random.choice(data[data['education'] == 'Secondary or lower'].index, 100, replace=False)
data_ed_biased = data.drop(drop_indices1)
data_ed_biased = data.drop(drop_indices2)

In [None]:
# biased data
data_ed_biased.head()