In [177]:
import pandas as pd
import numpy as np
import miceforest as mf
from math import nan
# Note: properly installing lightgbm allows you to run miceforest. If you have an M1 mac, please see:
# https://towardsdatascience.com/install-xgboost-and-lightgbm-on-apple-m1-macs-cb75180a2dda

In [178]:
student_survey = pd.read_csv('data/36423-0002-Data.tsv',sep='\t')

### Math Enrollment

S2MSPR12: S2 D13 Teenager taking math class(es) in spring 2012
[Are you currently/Were you] taking a math course [during the spring term of 2012?]
1=Yes
0=No

S1MFALL09: S1 C03 9th grader is taking a math course in the fall 2009 term

In [179]:
student_survey = student_survey[(student_survey['S2MSPR12'] == 1) & (student_survey['S1MFALL09'] == 1)]
len(student_survey)

14575

### Sample size rounded to the nearest ten in accordance to NCES regulation.
Throughout their paper.

### Low teacher support 

S2MTCHTREAT: S2 D18A Teen's spring 2012 math teacher treats some kids better than others
How much do you agree or disagree with the following statements about your teacher for [math course
title]? Remember, none of your teachers or your principal will see any of the answers you provide.
Your teacher...
[treats/treated] some kids better than other kids.
1=Strongly agree
2=Agree

S2MTCHINTRST: S2 D18B Teen's spring 2012 math teacher makes math interesting
How much do you agree or disagree with the following statements about your teacher for [math course
title]? Remember, none of your teachers or your principal will see any of the answers you provide.
Your teacher...
[makes/made] math interesting.
1=Strongly agree
2=Agree
3=Disagree
4=Strongly disagree


S2MTCHEASY: S2 D18C Teen's spring 2012 math teacher makes math easy to understand
How much do you agree or disagree with the following statements about your teacher for [math course
title]? Remember, none of your teachers or your principal will see any of the answers you provide.
Your teacher...
[makes/made] math easy to understand.
1=Strongly agree
2=Agree
3=Disagree
4=Strongly disagree

S2MTCHTHINK: S2 D18D Teen's spring 2012 math teacher wants students to think, not
memorize
How much do you agree or disagree with the following statements about your teacher for [math course
title]? Remember, none of your teachers or your principal will see any of the answers you provide.
Your teacher...
[wants/wanted] students to think, not just memorize things.
1=Strongly agree
2=Agree
3=Disagree
4=Strongly disagree

S2MTCHGIVEUP: S2 D18E Teen's spring 2012 math teacher doesn't let students give up
title]? Remember, none of your teachers or your principal will see any of the answers you provide.
Your teacher...
[doesn't/didn't] let people give up when the work [gets/got] hard.
1=Strongly agree
2=Agree
3=Disagree
4=Strongly disagree

### Ability self-concepts

S1MTESTS: S1 C08A 9th grader confident can do excellent job on fall 2009 math tests
How much do you agree or disagree with the following statements about your [fall 2009 math] course?
You are confident that you can do an excellent job on tests in this course
Strongly agree
Agree

S1MTEXTBOOK: S1 C08B 9th grader certain can understand fall 2009 math textbook
How much do you agree or disagree with the following statements about your [fall 2009 math] course?
You are certain that you can understand the most difficult material presented in the textbook used
in this course
Strongly agree
Agree
Disagree
Strongly disagree

S1MSKILLS: S1 C08C 9th grader certain can master skills in fall 2009 math course
How much do you agree or disagree with the following statements about your [fall 2009 math] course?
You are certain that you can master the skills being taught in this course
Strongly agree
Agree
Disagree
Strongly disagree

S1MASSEXCL: S1 C08D 9th grader confident can do excellent job on fall 2009 math
assignments
How much do you agree or disagree with the following statements about your [fall 2009 math] course?
You are confident that you can do an excellent job on assignments in this course
Strongly agree
Agree
Disagree
Strongly disagree

### Parental support

P1MUSEUM: P1 E07A Went to science or engineering museum with 9th grader in last year
P2MUSEUM: P2 B10A Visited science-related destination together in last year
Value Label Unweighted
Frequency
%
0 No 7195 30.6 %
1 Yes 8253 35.1 %
Missing Data
-9 Missing 1340 5.7 %
-8 Unit non-response 6715 28.6 %

Value Label Unweighted
Frequency
%
0 No 4837 20.6 %
1 Yes 3248 13.8 %
Missing Data
-9 Missing 63 0.3 %
-8 Unit non-response 2603 11.1 %
-6 Component not applicable 12279 52.2 %
-4 Item not administered: abbreviated interview 473 2.0 %

P1COMPUTER: P1 E07B Worked or played on computer with 9th grader in last year
P2COMPUTER: P2 B10B Worked or played on computer with teenager in last year

P1FIXED: P1 E07C Built or fixed something with 9th grader in last year
P2FIXED: P2 B10C Built or fixed something with teenager in last year

P1LIBRARY: P1 E07G Visited a library with 9th grader in last year
P2LIBRARY: P2 B10F Visited a library with teenager in last year

P1STEMDISC: P1 E07F Discussed STEM program or article with 9th grader in last year
P2STEMDISC: P2 B10E Discussed STEM program or article with teenager in last year

### Math Acheivement Score

X2TXMSCR: X2 Mathematics IRT-estimated number right score (of ## first follow-up items)
Based upon 20,594 valid cases out of 23,503 total cases.
• Mean: 67.2219
• Minimum: 25.0057
• Maximum: 115.1000
• Standard Deviation: 19.2183

X2X1TXMSCR: X2 Mathematics IRT-estimated number right score at time of base year (of 118
first follow-up items)

X3THIMATH9: X3 Highest level mathematics course taken - ninth grade
13 AP/IB Calculus 0 0.0 %
Missing Data
-9 Missing 770 3.3 %
-8 Unit non-response

In [180]:
all_columns = [
    'S2MSPR12',
    'S1MFALL09',
    'S2MTCHTREAT', 
    'S2MTCHINTRST',
    'S2MTCHEASY',
    'S2MTCHTHINK',
    'S2MTCHGIVEUP',
    'S1MTESTS',
    'S1MTEXTBOOK',
    'S1MSKILLS',
    'S1MASSEXCL',
    'P1MUSEUM',
    'P2MUSEUM',
    'P1COMPUTER',
    'P2COMPUTER',
    'P1FIXED',
    'P2FIXED',
    'P1LIBRARY',
    'P2LIBRARY',
    'P1STEMDISC',
    'P2STEMDISC',
    'X2TXMSCR',
    'X2X1TXMSCR',
    'X1SEX',
    'X1RACE',
    'X1SES_U',
    'X3THIMATH9',
    'X1TXMSCR',
    'W1PARENT'
]

low_teacher_support = [
    ('S2MTCHTREAT','treats some kids better'),
    ('S2MTCHINTRST','makes math interesting'),
    ('S2MTCHEASY','makes math easy to understand'),
    ('S2MTCHTHINK','wants students to think'),
    ('S2MTCHGIVEUP','doesnt let students give up')
]

ability_self_concept = [
    ('S1MTESTS','confident can do excellent job on test'),
    ('S1MTEXTBOOK','certain can understand math textbook'),
    ('S1MSKILLS','certain can master math skills'),
    ('S1MASSEXCL','confident can do excellent job on assignments')
]

parental_support = [
    ('P1MUSEUM','went to science or engineering museum'),
    ('P2MUSEUM','went to science or engineering museum'),
    ('P1COMPUTER','worked or played on computer'),
    ('P2COMPUTER','worked or played on computer'),
    ('P1FIXED','built or fixed something'),
    ('P2FIXED','built or fixed something'),
    ('P1LIBRARY','visited a library'),
    ('P2LIBRARY','visited a library'),
    ('P1STEMDISC','discussed STEM program or article'),
    ('P2STEMDISC','discussed STEM program or article'),]

covariates = [
    'sex',
    'race',
    'SES',
    'base_year_score',
    'highest_level_math'
]

math_acheivement_score = [('X2TXMSCR', 'score')]

highest_level_math = [('X3THIMATH9', 'level')]

base_year_score = [('X1TXMSCR', 'base_score')]

Students’ demographic information including their gender, race/ethnicity, 9 grade math achievement using the IRT-estimated
score (i.e., a criterion-referenced measure of achievement on algebraic reasoning assessment which was similarly constructed and administered as the assessment in 11th grade), and socioeconomic status (i.e., a composite measure of parents’ education, occupation, and family income) collected in 9th grade were included in the analyses as covariates. Also, students’ highest-level math course taken in 9th grade (1 = Basic math, 13 = AP/IB calculus) from the high school transcript was included.

In [181]:
student_survey = student_survey[all_columns]

student_survey['sex'] = student_survey['X1SEX']
student_survey.loc[student_survey["sex"] == 1, "sex"] = 0
student_survey.loc[student_survey["sex"] == 2, "sex"] = 1
student_survey['sex'] = student_survey['sex'].astype(np.int8)

student_survey['race'] = student_survey['X1RACE']
student_survey.loc[student_survey["race"] == 8, "race"] = 10
student_survey.loc[student_survey["race"] == 3, "race"] = 11
student_survey.loc[student_survey["race"] == 4, "race"] = 12
student_survey.loc[student_survey["race"] == 5, "race"] = 12
student_survey.loc[student_survey["race"] == 2, "race"] = 13
student_survey.loc[student_survey["race"] == 6, "race"] = 14
student_survey.loc[student_survey["race"] == 7, "race"] = 15
student_survey.loc[student_survey["race"] == 1, "race"] = 15
student_survey['race'] = student_survey['race'] - 10
student_survey['race'] = student_survey['race'].astype(np.int8)

student_survey['SES'] = student_survey['X1SES_U']
# pd.qcut(student_survey['X1SES_U'],4,labels=np.arange(4) + 1)
# student_survey['SES'] = student_survey['SES'].astype(np.int8)

In [184]:
weights = student_survey['W1PARENT']
student_survey.drop(['W1PARENT'], axis=1)

Unnamed: 0,S2MSPR12,S1MFALL09,S2MTCHTREAT,S2MTCHINTRST,S2MTCHEASY,S2MTCHTHINK,S2MTCHGIVEUP,S1MTESTS,S1MTEXTBOOK,S1MSKILLS,...,X2TXMSCR,X2X1TXMSCR,X1SEX,X1RACE,X1SES_U,X3THIMATH9,X1TXMSCR,sex,race,SES
0,1,1,2,2,2,2,3,1,2,2,...,99.1403,69.4994,1,8,1.6907,6,50.4919,0,0,1.6907
1,1,1,3,3,3,2,2,2,2,2,...,72.4904,49.4710,2,8,-0.3923,2,35.8045,1,0,-0.3923
2,1,1,3,1,1,2,1,1,3,2,...,75.4243,77.3584,2,3,1.1271,5,56.0477,1,1,1.1271
6,1,1,2,4,4,2,3,2,2,2,...,60.0290,52.7841,2,8,-0.4774,3,38.4063,1,0,-0.4774
9,1,1,2,3,2,2,3,1,1,1,...,86.8286,76.6330,2,8,0.1081,4,55.5463,1,0,0.1081
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23497,1,1,3,1,1,1,1,2,2,2,...,85.8476,70.7533,1,8,1.2713,-9,51.3889,0,0,1.2713
23499,1,1,3,3,2,1,1,2,1,2,...,60.8373,46.4682,2,5,-1.3350,3,33.3823,1,2,-1.3350
23500,1,1,1,1,1,1,1,2,3,2,...,60.9564,53.2958,2,8,-0.0031,4,38.8000,1,0,-0.0031
23501,1,1,2,4,4,3,4,2,2,2,...,65.1187,72.6301,1,8,0.7236,4,52.7281,0,0,0.7236


In [185]:
student_survey = student_survey.apply(pd.to_numeric, errors = 'coerce')
student_survey

Unnamed: 0,S2MSPR12,S1MFALL09,S2MTCHTREAT,S2MTCHINTRST,S2MTCHEASY,S2MTCHTHINK,S2MTCHGIVEUP,S1MTESTS,S1MTEXTBOOK,S1MSKILLS,...,X2X1TXMSCR,X1SEX,X1RACE,X1SES_U,X3THIMATH9,X1TXMSCR,W1PARENT,sex,race,SES
0,1,1,2,2,2,2,3,1,2,2,...,69.4994,1,8,1.6907,6,50.4919,470.250141,0,0,1.6907
1,1,1,3,3,3,2,2,2,2,2,...,49.4710,2,8,-0.3923,2,35.8045,224.455466,1,0,-0.3923
2,1,1,3,1,1,2,1,1,3,2,...,77.3584,2,3,1.1271,5,56.0477,185.301339,1,1,1.1271
6,1,1,2,4,4,2,3,2,2,2,...,52.7841,2,8,-0.4774,3,38.4063,379.440827,1,0,-0.4774
9,1,1,2,3,2,2,3,1,1,1,...,76.6330,2,8,0.1081,4,55.5463,242.626125,1,0,0.1081
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23497,1,1,3,1,1,1,1,2,2,2,...,70.7533,1,8,1.2713,-9,51.3889,73.287998,0,0,1.2713
23499,1,1,3,3,2,1,1,2,1,2,...,46.4682,2,5,-1.3350,3,33.3823,10.120169,1,2,-1.3350
23500,1,1,1,1,1,1,1,2,3,2,...,53.2958,2,8,-0.0031,4,38.8000,98.823515,1,0,-0.0031
23501,1,1,2,4,4,3,4,2,2,2,...,72.6301,1,8,0.7236,4,52.7281,262.402860,0,0,0.7236


### Imputation
The authors specify that they use ``multiple imputation'' procedures to impute missing data. Because they do not specify which procedure they use (beyond that they use the STATA package), we do best practice work here (in python) and use an MI library based on LightGBM and the MICE algorithm.

In [186]:
temp = student_survey.loc[:, student_survey.columns != 'SES']
temp[temp < 0] = nan
student_survey.loc[:, student_survey.columns != 'SES'] = temp

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp[temp < 0] = nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp[temp < 0] = nan


In [187]:
student_survey

Unnamed: 0,S2MSPR12,S1MFALL09,S2MTCHTREAT,S2MTCHINTRST,S2MTCHEASY,S2MTCHTHINK,S2MTCHGIVEUP,S1MTESTS,S1MTEXTBOOK,S1MSKILLS,...,X2X1TXMSCR,X1SEX,X1RACE,X1SES_U,X3THIMATH9,X1TXMSCR,W1PARENT,sex,race,SES
0,1,1,2.0,2.0,2.0,2.0,3.0,1.0,2.0,2.0,...,69.4994,1,8,1.6907,6.0,50.4919,470.250141,0,0,1.6907
1,1,1,3.0,3.0,3.0,2.0,2.0,2.0,2.0,2.0,...,49.4710,2,8,,2.0,35.8045,224.455466,1,0,-0.3923
2,1,1,3.0,1.0,1.0,2.0,1.0,1.0,3.0,2.0,...,77.3584,2,3,1.1271,5.0,56.0477,185.301339,1,1,1.1271
6,1,1,2.0,4.0,4.0,2.0,3.0,2.0,2.0,2.0,...,52.7841,2,8,,3.0,38.4063,379.440827,1,0,-0.4774
9,1,1,2.0,3.0,2.0,2.0,3.0,1.0,1.0,1.0,...,76.6330,2,8,0.1081,4.0,55.5463,242.626125,1,0,0.1081
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23497,1,1,3.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,...,70.7533,1,8,1.2713,,51.3889,73.287998,0,0,1.2713
23499,1,1,3.0,3.0,2.0,1.0,1.0,2.0,1.0,2.0,...,46.4682,2,5,,3.0,33.3823,10.120169,1,2,-1.3350
23500,1,1,1.0,1.0,1.0,1.0,1.0,2.0,3.0,2.0,...,53.2958,2,8,,4.0,38.8000,98.823515,1,0,-0.0031
23501,1,1,2.0,4.0,4.0,3.0,4.0,2.0,2.0,2.0,...,72.6301,1,8,0.7236,4.0,52.7281,262.402860,0,0,0.7236


In [188]:
# Using pip
# ! pip install miceforest --no-cache-dir

In [189]:
# Create kernel. 
kds = mf.ImputationKernel(
  student_survey,
  datasets=1,
  save_all_iterations=False,
  random_state=42
)

# Run the MICE algorithm for 2 iterations
kds.mice(2)



In [190]:
completed_dataset = kds.complete_data(dataset=0, inplace=False)
student_survey.describe() - completed_dataset.describe()

Unnamed: 0,S2MSPR12,S1MFALL09,S2MTCHTREAT,S2MTCHINTRST,S2MTCHEASY,S2MTCHTHINK,S2MTCHGIVEUP,S1MTESTS,S1MTEXTBOOK,S1MSKILLS,...,X2X1TXMSCR,X1SEX,X1RACE,X1SES_U,X3THIMATH9,X1TXMSCR,W1PARENT,sex,race,SES
count,0.0,0.0,-243.0,-243.0,-252.0,-257.0,-271.0,-67.0,-98.0,-114.0,...,0.0,0.0,0.0,-6624.0,-1057.0,0.0,0.0,0.0,0.0,0.0
mean,0.0,0.0,0.000336,-0.001188,-7.6e-05,-0.001923,-0.001205,-0.00021,-0.000826,-2.4e-05,...,0.0,0.0,0.0,0.317936,0.002861,0.0,0.0,0.0,0.0,0.0
std,0.0,0.0,-0.000195,0.000774,-2.9e-05,-0.000579,0.000269,0.000669,0.000406,0.000537,...,0.0,0.0,0.0,0.011837,-0.004274,0.0,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.25725,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.517,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.38405,0.0,0.0,0.0,0.0,0.0,0.0
max,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [191]:
#student_survey = student_survey[(student_survey[all_columns] >= 0).all(axis=1)]
student_survey

Unnamed: 0,S2MSPR12,S1MFALL09,S2MTCHTREAT,S2MTCHINTRST,S2MTCHEASY,S2MTCHTHINK,S2MTCHGIVEUP,S1MTESTS,S1MTEXTBOOK,S1MSKILLS,...,X2X1TXMSCR,X1SEX,X1RACE,X1SES_U,X3THIMATH9,X1TXMSCR,W1PARENT,sex,race,SES
0,1,1,2.0,2.0,2.0,2.0,3.0,1.0,2.0,2.0,...,69.4994,1,8,1.6907,6.0,50.4919,470.250141,0,0,1.6907
1,1,1,3.0,3.0,3.0,2.0,2.0,2.0,2.0,2.0,...,49.4710,2,8,,2.0,35.8045,224.455466,1,0,-0.3923
2,1,1,3.0,1.0,1.0,2.0,1.0,1.0,3.0,2.0,...,77.3584,2,3,1.1271,5.0,56.0477,185.301339,1,1,1.1271
6,1,1,2.0,4.0,4.0,2.0,3.0,2.0,2.0,2.0,...,52.7841,2,8,,3.0,38.4063,379.440827,1,0,-0.4774
9,1,1,2.0,3.0,2.0,2.0,3.0,1.0,1.0,1.0,...,76.6330,2,8,0.1081,4.0,55.5463,242.626125,1,0,0.1081
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23497,1,1,3.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,...,70.7533,1,8,1.2713,,51.3889,73.287998,0,0,1.2713
23499,1,1,3.0,3.0,2.0,1.0,1.0,2.0,1.0,2.0,...,46.4682,2,5,,3.0,33.3823,10.120169,1,2,-1.3350
23500,1,1,1.0,1.0,1.0,1.0,1.0,2.0,3.0,2.0,...,53.2958,2,8,,4.0,38.8000,98.823515,1,0,-0.0031
23501,1,1,2.0,4.0,4.0,3.0,4.0,2.0,2.0,2.0,...,72.6301,1,8,0.7236,4.0,52.7281,262.402860,0,0,0.7236


### Combine columns into predictors

In [192]:
low = [i[0] for i in low_teacher_support]
student_df = completed_dataset[low].dropna()
teacher_var = (student_df.sum(axis=1)/len(low)).to_frame()
teacher_var.describe()
teacher_var.columns = ['teacher']
teacher_var

Unnamed: 0,teacher
0,2.2
1,2.6
2,1.6
6,3.0
9,2.4
...,...
23497,1.4
23499,2.0
23500,1.0
23501,3.4


In [193]:
# he scale was reverse-coded so that high scores signified strong math ability self-concepts (1 = Strongly disagree, 4 = Strongly agree).
reverse_code = {
    'S1MTESTS': {1: 4, 2: 3, 3: 2, 4: 1},
    'S1MTEXTBOOK': {1: 4, 2: 3, 3: 2, 4: 1},
    'S1MSKILLS': {1: 4, 2: 3, 3: 2, 4: 1},
    'S1MASSEXCL': {1: 4, 2: 3, 3: 2, 4: 1},
}

ability = [i[0] for i in ability_self_concept]
ability_df = completed_dataset[ability].dropna()
ability_df = ability_df.replace(reverse_code)
ability_var = (ability_df.sum(axis=1)/len(ability)).to_frame()
ability_var.columns = ['ability']
ability_var.describe()

Unnamed: 0,ability
count,14575.0
mean,2.975146
std,0.643018
min,1.0
25%,2.75
50%,3.0
75%,3.5
max,4.0


In [194]:
parent = [i[0] for i in parental_support]
parent_df = completed_dataset[parent]
parent_df

parental_var = (parent_df.sum(axis=1)/len(parent)).to_frame()
parental_var.columns = ['parents']
parental_var.describe()

Unnamed: 0,parents
count,14575.0
mean,0.614772
std,0.235764
min,0.0
25%,0.5
50%,0.6
75%,0.8
max,1.0


In [195]:
level = [i[0] for i in highest_level_math]
level_var = completed_dataset[level]
level_var.columns = ['base_level']
level_var.describe()

Unnamed: 0,base_level
count,14575.0
mean,4.364871
std,1.464075
min,0.0
25%,4.0
50%,4.0
75%,5.0
max,11.0


In [196]:
acheive = [i[0] for i in math_acheivement_score]
acheive_var = completed_dataset[acheive]
acheive_var.describe()
acheive_var['math'] = acheive_var['X2TXMSCR']
acheive_var = acheive_var.drop(['X2TXMSCR'], axis=1)
acheive_var.describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  acheive_var['math'] = acheive_var['X2TXMSCR']


Unnamed: 0,math
count,14575.0
mean,69.946709
std,18.729319
min,25.0057
25%,58.91525
50%,69.4297
75%,85.15615
max,115.1


In [197]:
base = [i[0] for i in base_year_score]
base_var = completed_dataset[base]
base_var.describe()
base_var['base_math'] = base_var['X1TXMSCR']
base_var = base_var.drop(['X1TXMSCR'], axis=1)
base_var.describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  base_var['base_math'] = base_var['X1TXMSCR']


Unnamed: 0,base_math
count,14575.0
mean,42.115971
std,11.610209
min,15.8641
25%,34.50865
50%,42.2088
75%,50.709
max,69.9317


In [198]:
completed_dataset['SES'].describe()

count    14575.000000
mean         0.148680
std          0.789495
min         -1.906800
25%         -0.427050
50%          0.092500
75%          0.676300
max          2.978300
Name: SES, dtype: float64

In [199]:
full_df = pd.concat([acheive_var, 
                     teacher_var, 
                     ability_var, 
                     parental_var, 
                     completed_dataset['sex'], 
                     completed_dataset['SES'], 
                     base_var,
                     level_var], axis=1)

In [200]:
full_df.corr()
# YAY! NOTE: probably using the wrong SES - try and figure out which one they actually used...

Unnamed: 0,math,teacher,ability,parents,sex,SES,base_math,base_level
math,1.0,-0.122527,0.299191,0.142562,-0.022433,0.414977,0.745042,0.358769
teacher,-0.122527,1.0,-0.13809,-0.034284,0.031998,-0.051738,-0.086148,-0.042563
ability,0.299191,-0.13809,1.0,0.082273,-0.105274,0.13077,0.302118,0.138293
parents,0.142562,-0.034284,0.082273,1.0,-0.055088,0.22597,0.13108,0.102144
sex,-0.022433,0.031998,-0.105274,-0.055088,1.0,-0.004499,-0.018885,0.022568
SES,0.414977,-0.051738,0.13077,0.22597,-0.004499,1.0,0.408821,0.229129
base_math,0.745042,-0.086148,0.302118,0.13108,-0.018885,0.408821,1.0,0.38301
base_level,0.358769,-0.042563,0.138293,0.102144,0.022568,0.229129,0.38301,1.0


In [201]:
RACE_MAP = {
    0: "White",
    1: "Black",
    2: "Hispanic",
    3: "Asian",
    4: "Other",
    5: "Other"
}

SEX_MAP = {
    0: "Male",
    1: "Female"
}

In [202]:
full_df_regression = full_df.copy()
full_df_regression['race'] = completed_dataset['race']
full_df_regression = full_df_regression.replace({'sex': SEX_MAP,
                                                 'race': RACE_MAP})

In [203]:
full_df_regression

Unnamed: 0,math,teacher,ability,parents,sex,SES,base_math,base_level,race
0,99.1403,2.2,3.50,0.6,Male,1.6907,50.4919,6.0,White
1,72.4904,2.6,3.25,0.0,Female,-0.3923,35.8045,2.0,White
2,75.4243,1.6,3.25,0.8,Female,1.1271,56.0477,5.0,Black
6,60.0290,3.0,3.00,0.8,Female,-0.4774,38.4063,3.0,White
9,86.8286,2.4,4.00,0.6,Female,0.1081,55.5463,4.0,White
...,...,...,...,...,...,...,...,...,...
23497,85.8476,1.4,3.00,0.8,Male,1.2713,51.3889,5.0,White
23499,60.8373,2.0,3.50,0.4,Female,-1.3350,33.3823,3.0,Hispanic
23500,60.9564,1.0,2.75,0.6,Female,-0.0031,38.8000,4.0,White
23501,65.1187,3.4,3.00,0.7,Male,0.7236,52.7281,4.0,White


In [214]:
from statsmodels.regression.linear_model import WLS

model_lts = WLS.from_formula(
    'math ~ teacher + C(sex, Treatment(reference="Male")) + C(race, Treatment(reference="White")) + SES + base_math + base_level',
    data=full_df_regression,
    freq_weights=np.array(weights.array))
regression_lts = model_lts.fit(method='pinv')



In [215]:
regression_lts.summary2()

0,1,2,3
Model:,WLS,Adj. R-squared:,0.581
Dependent Variable:,math,AIC:,114118.0253
Date:,2022-08-09 15:52,BIC:,114193.8959
No. Observations:,14575,Log-Likelihood:,-57049.0
Df Model:,9,F-statistic:,2243.0
Df Residuals:,14565,Prob (F-statistic):,0.0
R-squared:,0.581,Scale:,147.11

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,25.9244,0.6631,39.0945,0.0000,24.6246,27.2242
"C(sex, Treatment(reference=""Male""))[T.Female]",-0.3672,0.2012,-1.8252,0.0680,-0.7615,0.0271
"C(race, Treatment(reference=""White""))[T.Asian]",3.3432,0.3771,8.8665,0.0000,2.6041,4.0822
"C(race, Treatment(reference=""White""))[T.Black]",-1.6849,0.3606,-4.6728,0.0000,-2.3916,-0.9781
"C(race, Treatment(reference=""White""))[T.Hispanic]",-0.0878,0.3008,-0.2918,0.7704,-0.6774,0.5019
"C(race, Treatment(reference=""White""))[T.Other]",-0.0681,0.3448,-0.1976,0.8434,-0.7441,0.6078
teacher,-1.9666,0.1922,-10.2295,0.0000,-2.3435,-1.5898
SES,2.9032,0.1435,20.2381,0.0000,2.6221,3.1844
base_math,1.0444,0.0103,101.7839,0.0000,1.0243,1.0645

0,1,2,3
Omnibus:,395.794,Durbin-Watson:,1.997
Prob(Omnibus):,0.0,Jarque-Bera (JB):,517.082
Skew:,-0.324,Prob(JB):,0.0
Kurtosis:,3.657,Condition No.:,300.0


In [216]:
model_sc = WLS.from_formula(
    'math ~ ability + C(sex, Treatment(reference="Male")) + C(race, Treatment(reference="White")) + SES + base_math + base_level',
    data=full_df_regression,
    freq_weights=np.array(weights.array))
regression_sc = model_sc.fit(method='pinv')
regression_sc.summary2()



0,1,2,3
Model:,WLS,Adj. R-squared:,0.583
Dependent Variable:,math,AIC:,114022.0426
Date:,2022-08-09 15:52,BIC:,114097.9133
No. Observations:,14575,Log-Likelihood:,-57001.0
Df Model:,9,F-statistic:,2269.0
Df Residuals:,14565,Prob (F-statistic):,0.0
R-squared:,0.584,Scale:,146.14

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,15.9296,0.6033,26.4034,0.0000,14.7470,17.1122
"C(sex, Treatment(reference=""Male""))[T.Female]",-0.1289,0.2015,-0.6398,0.5223,-0.5240,0.2661
"C(race, Treatment(reference=""White""))[T.Asian]",3.2972,0.3758,8.7737,0.0000,2.5606,4.0339
"C(race, Treatment(reference=""White""))[T.Black]",-2.0695,0.3610,-5.7322,0.0000,-2.7772,-1.3618
"C(race, Treatment(reference=""White""))[T.Hispanic]",-0.1308,0.2999,-0.4363,0.6627,-0.7186,0.4569
"C(race, Treatment(reference=""White""))[T.Other]",-0.1455,0.3438,-0.4232,0.6722,-0.8193,0.5284
ability,2.3462,0.1653,14.1975,0.0000,2.0223,2.6702
SES,2.9088,0.1430,20.3468,0.0000,2.6285,3.1890
base_math,1.0124,0.0106,95.8234,0.0000,0.9917,1.0331

0,1,2,3
Omnibus:,389.513,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,512.218
Skew:,-0.318,Prob(JB):,0.0
Kurtosis:,3.662,Condition No.:,273.0


In [222]:
model_sc = WLS.from_formula(
    'math ~ (ability * teacher * parents) + C(sex, Treatment(reference="Male")) + C(race, Treatment(reference="White")) + SES + base_math + base_level',
    data=full_df_regression,
    freq_weights=np.array(weights.array))
regression_sc = model_sc.fit(method='pinv')
regression_sc.summary2()



0,1,2,3
Model:,WLS,Adj. R-squared:,0.586
Dependent Variable:,math,AIC:,113936.2602
Date:,2022-08-10 08:10,BIC:,114057.6532
No. Observations:,14575,Log-Likelihood:,-56952.0
Df Model:,15,F-statistic:,1376.0
Df Residuals:,14559,Prob (F-statistic):,0.0
R-squared:,0.586,Scale:,145.22

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,21.6437,5.4841,3.9466,0.0001,10.8941,32.3933
"C(sex, Treatment(reference=""Male""))[T.Female]",-0.0454,0.2013,-0.2257,0.8214,-0.4400,0.3491
"C(race, Treatment(reference=""White""))[T.Asian]",3.2984,0.3759,8.7756,0.0000,2.5617,4.0352
"C(race, Treatment(reference=""White""))[T.Black]",-2.1470,0.3601,-5.9629,0.0000,-2.8528,-1.4413
"C(race, Treatment(reference=""White""))[T.Hispanic]",-0.1860,0.2991,-0.6219,0.5340,-0.7723,0.4003
"C(race, Treatment(reference=""White""))[T.Other]",-0.1571,0.3427,-0.4583,0.6467,-0.8289,0.5147
ability,1.4561,1.8147,0.8024,0.4223,-2.1009,5.0131
teacher,-1.5595,2.2873,-0.6818,0.4954,-6.0429,2.9240
ability:teacher,-0.0553,0.7677,-0.0721,0.9426,-1.5601,1.4495

0,1,2,3
Omnibus:,380.705,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,495.451
Skew:,-0.317,Prob(JB):,0.0
Kurtosis:,3.644,Condition No.:,5039.0


In [223]:
table_3 = {}
model_lts = WLS.from_formula(
    'math ~ teacher + C(sex, Treatment(reference="Male")) + C(race, Treatment(reference="White")) + SES + base_math + base_level',
    data=full_df_regression,
    freq_weights=np.array(weights.array))
regression_lts = model_lts.fit(method='pinv')
table_3['model_1'] = regression_lts.summary2()

model_sc = WLS.from_formula(
    'math ~ ability + C(sex, Treatment(reference="Male")) + C(race, Treatment(reference="White")) + SES + base_math + base_level',
    data=full_df_regression,
    freq_weights=np.array(weights.array))
regression_sc = model_sc.fit(method='pinv')
table_3['model_2'] = regression_sc.summary2()

model_sc_lts = WLS.from_formula(
    'math ~ (ability * teacher) + C(sex, Treatment(reference="Male")) + C(race, Treatment(reference="White")) + SES + base_math + base_level',
    data=full_df_regression,
    freq_weights=np.array(weights.array))
regression_sc_lts = model_sc_lts.fit(method='pinv')
table_3['model_3'] = regression_sc_lts.summary2()

model_ps_sc = WLS.from_formula(
    'math ~ (ability * teacher) + C(sex, Treatment(reference="Male")) + C(race, Treatment(reference="White")) + SES + base_math + base_level',
    data=full_df_regression,
    freq_weights=np.array(weights.array))
regression_ps_sc = model_ps_sc.fit(method='pinv')
table_3['model_2'] = regression_sc_lts.summary2()

model_all = WLS.from_formula(
    'math ~ (ability * teacher * parents) + C(sex, Treatment(reference="Male")) + C(race, Treatment(reference="White")) + SES + base_math + base_level',
    data=full_df_regression,
    freq_weights=np.array(weights.array))
regression_all = model_all.fit(method='pinv')
regression_all.summary2()



0,1,2,3
Model:,WLS,Adj. R-squared:,0.586
Dependent Variable:,math,AIC:,113936.2602
Date:,2022-08-10 08:46,BIC:,114057.6532
No. Observations:,14575,Log-Likelihood:,-56952.0
Df Model:,15,F-statistic:,1376.0
Df Residuals:,14559,Prob (F-statistic):,0.0
R-squared:,0.586,Scale:,145.22

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,21.6437,5.4841,3.9466,0.0001,10.8941,32.3933
"C(sex, Treatment(reference=""Male""))[T.Female]",-0.0454,0.2013,-0.2257,0.8214,-0.4400,0.3491
"C(race, Treatment(reference=""White""))[T.Asian]",3.2984,0.3759,8.7756,0.0000,2.5617,4.0352
"C(race, Treatment(reference=""White""))[T.Black]",-2.1470,0.3601,-5.9629,0.0000,-2.8528,-1.4413
"C(race, Treatment(reference=""White""))[T.Hispanic]",-0.1860,0.2991,-0.6219,0.5340,-0.7723,0.4003
"C(race, Treatment(reference=""White""))[T.Other]",-0.1571,0.3427,-0.4583,0.6467,-0.8289,0.5147
ability,1.4561,1.8147,0.8024,0.4223,-2.1009,5.0131
teacher,-1.5595,2.2873,-0.6818,0.4954,-6.0429,2.9240
ability:teacher,-0.0553,0.7677,-0.0721,0.9426,-1.5601,1.4495

0,1,2,3
Omnibus:,380.705,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,495.451
Skew:,-0.317,Prob(JB):,0.0
Kurtosis:,3.644,Condition No.:,5039.0


In [230]:
regression_all.get_robustcov_results().summary()

0,1,2,3
Dep. Variable:,math,R-squared:,0.586
Model:,WLS,Adj. R-squared:,0.586
Method:,Least Squares,F-statistic:,1640.0
Date:,"Wed, 10 Aug 2022",Prob (F-statistic):,0.0
Time:,08:49:04,Log-Likelihood:,-56952.0
No. Observations:,14575,AIC:,113900.0
Df Residuals:,14559,BIC:,114100.0
Df Model:,15,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,21.6437,6.234,3.472,0.001,9.424,33.863
"C(sex, Treatment(reference=""Male""))[T.Female]",-0.0454,0.202,-0.225,0.822,-0.441,0.350
"C(race, Treatment(reference=""White""))[T.Asian]",3.2984,0.363,9.092,0.000,2.587,4.010
"C(race, Treatment(reference=""White""))[T.Black]",-2.1470,0.369,-5.824,0.000,-2.870,-1.424
"C(race, Treatment(reference=""White""))[T.Hispanic]",-0.1860,0.309,-0.602,0.547,-0.791,0.419
"C(race, Treatment(reference=""White""))[T.Other]",-0.1571,0.339,-0.464,0.643,-0.821,0.507
ability,1.4561,2.034,0.716,0.474,-2.531,5.443
teacher,-1.5595,2.645,-0.590,0.556,-6.744,3.626
ability:teacher,-0.0553,0.874,-0.063,0.950,-1.769,1.658

0,1,2,3
Omnibus:,380.705,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,495.451
Skew:,-0.317,Prob(JB):,2.6e-108
Kurtosis:,3.644,Cond. No.,5040.0


In [238]:
cov = regression_all.cov_params()
#ind_col = list(cov.columns).index('ability:teacher:parents')

In [240]:
regression_all.cov_params()

Unnamed: 0,Intercept,"C(sex, Treatment(reference=""Male""))[T.Female]","C(race, Treatment(reference=""White""))[T.Asian]","C(race, Treatment(reference=""White""))[T.Black]","C(race, Treatment(reference=""White""))[T.Hispanic]","C(race, Treatment(reference=""White""))[T.Other]",ability,teacher,ability:teacher,parents,ability:parents,teacher:parents,ability:teacher:parents,SES,base_math,base_level
Intercept,30.075845,-0.037299,-0.014637,-0.024403,-0.04481,-0.011691,-9.657384,-12.15309,3.942653,-42.784317,13.766182,17.397072,-5.621527,0.025645,-0.003325,-0.003524
"C(sex, Treatment(reference=""Male""))[T.Female]",-0.037299,0.040515,-0.000467,-0.000372,-0.000502,-0.000187,0.005872,0.003748,-0.001356,0.00316,0.000983,-0.000621,-2.9e-05,-0.000376,-3e-06,-0.000545
"C(race, Treatment(reference=""White""))[T.Asian]",-0.014637,-0.000467,0.141274,0.014219,0.015298,0.016581,0.00392,0.011199,-0.003132,0.047527,-0.012166,-0.018099,0.006243,-0.000628,-0.000478,-0.001265
"C(race, Treatment(reference=""White""))[T.Black]",-0.024403,-0.000372,0.014219,0.129643,0.021458,0.01887,-0.0062,0.00192,-0.000588,-0.010451,0.001408,0.001127,0.000553,0.00304,0.000586,-0.000203
"C(race, Treatment(reference=""White""))[T.Hispanic]",-0.04481,-0.000502,0.015298,0.021458,0.089462,0.019321,0.006633,0.008876,-0.003227,0.026354,-0.010848,-0.00922,0.004277,0.009079,0.000136,8e-06
"C(race, Treatment(reference=""White""))[T.Other]",-0.011691,-0.000187,0.016581,0.01887,0.019321,0.117477,-0.004551,-0.003222,0.001663,-0.022949,0.009161,0.010682,-0.004328,0.00297,4e-05,0.000416
ability,-9.657384,0.005872,0.00392,-0.0062,0.006633,-0.004551,3.293112,3.944285,-1.351433,13.764433,-4.664725,-5.625589,1.918598,-9e-06,-0.000155,-0.002508
teacher,-12.15309,0.003748,0.011199,0.00192,0.008876,-0.003222,3.944285,5.231814,-1.707254,17.40493,-5.628187,-7.50548,2.439437,-0.000643,0.000345,-0.002988
ability:teacher,3.942653,-0.001356,-0.003132,-0.000588,-0.003227,0.001663,-1.351433,-1.707254,0.589374,-5.621335,1.919044,2.438711,-0.8382,0.00014,-8.3e-05,0.000703
parents,-42.784317,0.00316,0.047527,-0.010451,0.026354,-0.022949,13.764433,17.40493,-5.621335,70.93961,-22.652729,-28.943122,9.282561,-0.01271,0.001582,-0.019886


In [243]:
cov.loc['parents:ability']['parents']

-22.652728603169585

In [244]:
! pip install smartnoise-synth

Collecting smartnoise-synth
  Using cached smartnoise_synth-0.2.6-py3-none-any.whl (53 kB)
Collecting opendp<0.5.0,>=0.4.0
  Using cached opendp-0.4.0-py3-none-any.whl (28.0 MB)
Collecting ctgan<0.5.0,>=0.4.3
  Using cached ctgan-0.4.3-py2.py3-none-any.whl (21 kB)
Collecting opacus<0.15.0,>=0.14.0
  Using cached opacus-0.14.0-py3-none-any.whl (114 kB)
Collecting torch<2,>=1.4
  Downloading torch-1.12.1-cp38-none-macosx_11_0_arm64.whl (49.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.1/49.1 MB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting torchvision<1,>=0.5.0
  Downloading torchvision-0.13.1-cp38-cp38-macosx_11_0_arm64.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting pandas<1.1.5,>=1.1
  Downloading pandas-1.1.4.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m35.5 MB/s[0m e