In [1]:
import pandas as pd
import numpy as np
import miceforest as mf
from math import nan
# Note: properly installing lightgbm allows you to run miceforest. If you have an M1 mac, please see:
# https://towardsdatascience.com/install-xgboost-and-lightgbm-on-apple-m1-macs-cb75180a2dda

In [2]:
student_survey = pd.read_csv('data/36423-0002-Data.tsv',sep='\t')

### Math Enrollment

S2MSPR12: S2 D13 Teenager taking math class(es) in spring 2012
[Are you currently/Were you] taking a math course [during the spring term of 2012?]
1=Yes
0=No

S1MFALL09: S1 C03 9th grader is taking a math course in the fall 2009 term

In [3]:
student_survey = student_survey[(student_survey['S2MSPR12'] == 1) & (student_survey['S1MFALL09'] == 1)]
len(student_survey)

14575

### Sample size rounded to the nearest ten in accordance to NCES regulation.
Throughout their paper.

### Low teacher support 

S2MTCHTREAT: S2 D18A Teen's spring 2012 math teacher treats some kids better than others
How much do you agree or disagree with the following statements about your teacher for [math course
title]? Remember, none of your teachers or your principal will see any of the answers you provide.
Your teacher...
[treats/treated] some kids better than other kids.
1=Strongly agree
2=Agree

S2MTCHINTRST: S2 D18B Teen's spring 2012 math teacher makes math interesting
How much do you agree or disagree with the following statements about your teacher for [math course
title]? Remember, none of your teachers or your principal will see any of the answers you provide.
Your teacher...
[makes/made] math interesting.
1=Strongly agree
2=Agree
3=Disagree
4=Strongly disagree


S2MTCHEASY: S2 D18C Teen's spring 2012 math teacher makes math easy to understand
How much do you agree or disagree with the following statements about your teacher for [math course
title]? Remember, none of your teachers or your principal will see any of the answers you provide.
Your teacher...
[makes/made] math easy to understand.
1=Strongly agree
2=Agree
3=Disagree
4=Strongly disagree

S2MTCHTHINK: S2 D18D Teen's spring 2012 math teacher wants students to think, not
memorize
How much do you agree or disagree with the following statements about your teacher for [math course
title]? Remember, none of your teachers or your principal will see any of the answers you provide.
Your teacher...
[wants/wanted] students to think, not just memorize things.
1=Strongly agree
2=Agree
3=Disagree
4=Strongly disagree

S2MTCHGIVEUP: S2 D18E Teen's spring 2012 math teacher doesn't let students give up
title]? Remember, none of your teachers or your principal will see any of the answers you provide.
Your teacher...
[doesn't/didn't] let people give up when the work [gets/got] hard.
1=Strongly agree
2=Agree
3=Disagree
4=Strongly disagree

### Ability self-concepts

S1MTESTS: S1 C08A 9th grader confident can do excellent job on fall 2009 math tests
How much do you agree or disagree with the following statements about your [fall 2009 math] course?
You are confident that you can do an excellent job on tests in this course
Strongly agree
Agree

S1MTEXTBOOK: S1 C08B 9th grader certain can understand fall 2009 math textbook
How much do you agree or disagree with the following statements about your [fall 2009 math] course?
You are certain that you can understand the most difficult material presented in the textbook used
in this course
Strongly agree
Agree
Disagree
Strongly disagree

S1MSKILLS: S1 C08C 9th grader certain can master skills in fall 2009 math course
How much do you agree or disagree with the following statements about your [fall 2009 math] course?
You are certain that you can master the skills being taught in this course
Strongly agree
Agree
Disagree
Strongly disagree

S1MASSEXCL: S1 C08D 9th grader confident can do excellent job on fall 2009 math
assignments
How much do you agree or disagree with the following statements about your [fall 2009 math] course?
You are confident that you can do an excellent job on assignments in this course
Strongly agree
Agree
Disagree
Strongly disagree

### Parental support

P1MUSEUM: P1 E07A Went to science or engineering museum with 9th grader in last year
P2MUSEUM: P2 B10A Visited science-related destination together in last year
Value Label Unweighted
Frequency
%
0 No 7195 30.6 %
1 Yes 8253 35.1 %
Missing Data
-9 Missing 1340 5.7 %
-8 Unit non-response 6715 28.6 %

Value Label Unweighted
Frequency
%
0 No 4837 20.6 %
1 Yes 3248 13.8 %
Missing Data
-9 Missing 63 0.3 %
-8 Unit non-response 2603 11.1 %
-6 Component not applicable 12279 52.2 %
-4 Item not administered: abbreviated interview 473 2.0 %

P1COMPUTER: P1 E07B Worked or played on computer with 9th grader in last year
P2COMPUTER: P2 B10B Worked or played on computer with teenager in last year

P1FIXED: P1 E07C Built or fixed something with 9th grader in last year
P2FIXED: P2 B10C Built or fixed something with teenager in last year

P1LIBRARY: P1 E07G Visited a library with 9th grader in last year
P2LIBRARY: P2 B10F Visited a library with teenager in last year

P1STEMDISC: P1 E07F Discussed STEM program or article with 9th grader in last year
P2STEMDISC: P2 B10E Discussed STEM program or article with teenager in last year

### Math Acheivement Score

X2TXMSCR: X2 Mathematics IRT-estimated number right score (of ## first follow-up items)
Based upon 20,594 valid cases out of 23,503 total cases.
• Mean: 67.2219
• Minimum: 25.0057
• Maximum: 115.1000
• Standard Deviation: 19.2183

X2X1TXMSCR: X2 Mathematics IRT-estimated number right score at time of base year (of 118
first follow-up items)

X3THIMATH9: X3 Highest level mathematics course taken - ninth grade
13 AP/IB Calculus 0 0.0 %
Missing Data
-9 Missing 770 3.3 %
-8 Unit non-response

In [4]:
all_columns = [
    'S2MSPR12',
    'S1MFALL09',
    'S2MTCHTREAT', 
    'S2MTCHINTRST',
    'S2MTCHEASY',
    'S2MTCHTHINK',
    'S2MTCHGIVEUP',
    'S1MTESTS',
    'S1MTEXTBOOK',
    'S1MSKILLS',
    'S1MASSEXCL',
    'P1MUSEUM',
    'P2MUSEUM',
    'P1COMPUTER',
    'P2COMPUTER',
    'P1FIXED',
    'P2FIXED',
    'P1LIBRARY',
    'P2LIBRARY',
    'P1STEMDISC',
    'P2STEMDISC',
    'X2TXMSCR',
    'X2X1TXMSCR',
    'X1SEX',
    'X1RACE',
    'X1SES_U',
    'X3THIMATH9',
    'X1TXMSCR',
    'W1PARENT'
]

low_teacher_support = [
    ('S2MTCHTREAT','treats some kids better'),
    ('S2MTCHINTRST','makes math interesting'),
    ('S2MTCHEASY','makes math easy to understand'),
    ('S2MTCHTHINK','wants students to think'),
    ('S2MTCHGIVEUP','doesnt let students give up')
]

ability_self_concept = [
    ('S1MTESTS','confident can do excellent job on test'),
    ('S1MTEXTBOOK','certain can understand math textbook'),
    ('S1MSKILLS','certain can master math skills'),
    ('S1MASSEXCL','confident can do excellent job on assignments')
]

parental_support = [
    ('P1MUSEUM','went to science or engineering museum'),
    ('P2MUSEUM','went to science or engineering museum'),
    ('P1COMPUTER','worked or played on computer'),
    ('P2COMPUTER','worked or played on computer'),
    ('P1FIXED','built or fixed something'),
    ('P2FIXED','built or fixed something'),
    ('P1LIBRARY','visited a library'),
    ('P2LIBRARY','visited a library'),
    ('P1STEMDISC','discussed STEM program or article'),
    ('P2STEMDISC','discussed STEM program or article'),]

covariates = [
    'sex',
    'race',
    'SES',
    'base_year_score',
    'highest_level_math'
]

math_acheivement_score = [('X2TXMSCR', 'score')]

highest_level_math = [('X3THIMATH9', 'level')]

base_year_score = [('X1TXMSCR', 'base_score')]

Students’ demographic information including their gender, race/ethnicity, 9 grade math achievement using the IRT-estimated
score (i.e., a criterion-referenced measure of achievement on algebraic reasoning assessment which was similarly constructed and administered as the assessment in 11th grade), and socioeconomic status (i.e., a composite measure of parents’ education, occupation, and family income) collected in 9th grade were included in the analyses as covariates. Also, students’ highest-level math course taken in 9th grade (1 = Basic math, 13 = AP/IB calculus) from the high school transcript was included.

In [5]:
student_survey = student_survey[all_columns]

student_survey['sex'] = student_survey['X1SEX']
student_survey.loc[student_survey["sex"] == 1, "sex"] = 0
student_survey.loc[student_survey["sex"] == 2, "sex"] = 1
student_survey['sex'] = student_survey['sex'].astype(np.int8)

student_survey['race'] = student_survey['X1RACE']
student_survey.loc[student_survey["race"] == 8, "race"] = 10
student_survey.loc[student_survey["race"] == 3, "race"] = 11
student_survey.loc[student_survey["race"] == 4, "race"] = 12
student_survey.loc[student_survey["race"] == 5, "race"] = 12
student_survey.loc[student_survey["race"] == 2, "race"] = 13
student_survey.loc[student_survey["race"] == 6, "race"] = 14
student_survey.loc[student_survey["race"] == 7, "race"] = 15
student_survey.loc[student_survey["race"] == 1, "race"] = 15
student_survey['race'] = student_survey['race'] - 10
student_survey['race'] = student_survey['race'].astype(np.int8)

student_survey['SES'] = student_survey['X1SES_U']
# pd.qcut(student_survey['X1SES_U'],4,labels=np.arange(4) + 1)
# student_survey['SES'] = student_survey['SES'].astype(np.int8)

In [6]:
weights = student_survey['W1PARENT']
student_survey.drop(['W1PARENT'], axis=1)

Unnamed: 0,S2MSPR12,S1MFALL09,S2MTCHTREAT,S2MTCHINTRST,S2MTCHEASY,S2MTCHTHINK,S2MTCHGIVEUP,S1MTESTS,S1MTEXTBOOK,S1MSKILLS,...,X2TXMSCR,X2X1TXMSCR,X1SEX,X1RACE,X1SES_U,X3THIMATH9,X1TXMSCR,sex,race,SES
0,1,1,2,2,2,2,3,1,2,2,...,99.1403,69.4994,1,8,1.6907,6,50.4919,0,0,1.6907
1,1,1,3,3,3,2,2,2,2,2,...,72.4904,49.4710,2,8,-0.3923,2,35.8045,1,0,-0.3923
2,1,1,3,1,1,2,1,1,3,2,...,75.4243,77.3584,2,3,1.1271,5,56.0477,1,1,1.1271
6,1,1,2,4,4,2,3,2,2,2,...,60.0290,52.7841,2,8,-0.4774,3,38.4063,1,0,-0.4774
9,1,1,2,3,2,2,3,1,1,1,...,86.8286,76.6330,2,8,0.1081,4,55.5463,1,0,0.1081
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23497,1,1,3,1,1,1,1,2,2,2,...,85.8476,70.7533,1,8,1.2713,-9,51.3889,0,0,1.2713
23499,1,1,3,3,2,1,1,2,1,2,...,60.8373,46.4682,2,5,-1.3350,3,33.3823,1,2,-1.3350
23500,1,1,1,1,1,1,1,2,3,2,...,60.9564,53.2958,2,8,-0.0031,4,38.8000,1,0,-0.0031
23501,1,1,2,4,4,3,4,2,2,2,...,65.1187,72.6301,1,8,0.7236,4,52.7281,0,0,0.7236


In [7]:
student_survey = student_survey.apply(pd.to_numeric, errors = 'coerce')
student_survey

Unnamed: 0,S2MSPR12,S1MFALL09,S2MTCHTREAT,S2MTCHINTRST,S2MTCHEASY,S2MTCHTHINK,S2MTCHGIVEUP,S1MTESTS,S1MTEXTBOOK,S1MSKILLS,...,X2X1TXMSCR,X1SEX,X1RACE,X1SES_U,X3THIMATH9,X1TXMSCR,W1PARENT,sex,race,SES
0,1,1,2,2,2,2,3,1,2,2,...,69.4994,1,8,1.6907,6,50.4919,470.250141,0,0,1.6907
1,1,1,3,3,3,2,2,2,2,2,...,49.4710,2,8,-0.3923,2,35.8045,224.455466,1,0,-0.3923
2,1,1,3,1,1,2,1,1,3,2,...,77.3584,2,3,1.1271,5,56.0477,185.301339,1,1,1.1271
6,1,1,2,4,4,2,3,2,2,2,...,52.7841,2,8,-0.4774,3,38.4063,379.440827,1,0,-0.4774
9,1,1,2,3,2,2,3,1,1,1,...,76.6330,2,8,0.1081,4,55.5463,242.626125,1,0,0.1081
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23497,1,1,3,1,1,1,1,2,2,2,...,70.7533,1,8,1.2713,-9,51.3889,73.287998,0,0,1.2713
23499,1,1,3,3,2,1,1,2,1,2,...,46.4682,2,5,-1.3350,3,33.3823,10.120169,1,2,-1.3350
23500,1,1,1,1,1,1,1,2,3,2,...,53.2958,2,8,-0.0031,4,38.8000,98.823515,1,0,-0.0031
23501,1,1,2,4,4,3,4,2,2,2,...,72.6301,1,8,0.7236,4,52.7281,262.402860,0,0,0.7236


### Imputation
The authors specify that they use ``multiple imputation'' procedures to impute missing data. Because they do not specify which procedure they use (beyond that they use the STATA package), we do best practice work here (in python) and use an MI library based on LightGBM and the MICE algorithm.

In [8]:
temp = student_survey.loc[:, student_survey.columns != 'SES']
temp[temp < 0] = nan
student_survey.loc[:, student_survey.columns != 'SES'] = temp

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp[temp < 0] = nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(-key, value, inplace=True)


In [9]:
student_survey

Unnamed: 0,S2MSPR12,S1MFALL09,S2MTCHTREAT,S2MTCHINTRST,S2MTCHEASY,S2MTCHTHINK,S2MTCHGIVEUP,S1MTESTS,S1MTEXTBOOK,S1MSKILLS,...,X2X1TXMSCR,X1SEX,X1RACE,X1SES_U,X3THIMATH9,X1TXMSCR,W1PARENT,sex,race,SES
0,1,1,2.0,2.0,2.0,2.0,3.0,1.0,2.0,2.0,...,69.4994,1,8,1.6907,6.0,50.4919,470.250141,0,0,1.6907
1,1,1,3.0,3.0,3.0,2.0,2.0,2.0,2.0,2.0,...,49.4710,2,8,,2.0,35.8045,224.455466,1,0,-0.3923
2,1,1,3.0,1.0,1.0,2.0,1.0,1.0,3.0,2.0,...,77.3584,2,3,1.1271,5.0,56.0477,185.301339,1,1,1.1271
6,1,1,2.0,4.0,4.0,2.0,3.0,2.0,2.0,2.0,...,52.7841,2,8,,3.0,38.4063,379.440827,1,0,-0.4774
9,1,1,2.0,3.0,2.0,2.0,3.0,1.0,1.0,1.0,...,76.6330,2,8,0.1081,4.0,55.5463,242.626125,1,0,0.1081
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23497,1,1,3.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,...,70.7533,1,8,1.2713,,51.3889,73.287998,0,0,1.2713
23499,1,1,3.0,3.0,2.0,1.0,1.0,2.0,1.0,2.0,...,46.4682,2,5,,3.0,33.3823,10.120169,1,2,-1.3350
23500,1,1,1.0,1.0,1.0,1.0,1.0,2.0,3.0,2.0,...,53.2958,2,8,,4.0,38.8000,98.823515,1,0,-0.0031
23501,1,1,2.0,4.0,4.0,3.0,4.0,2.0,2.0,2.0,...,72.6301,1,8,0.7236,4.0,52.7281,262.402860,0,0,0.7236


In [10]:
# Using pip
# ! pip install miceforest --no-cache-dir

In [11]:
# Create kernel. 
kds = mf.ImputationKernel(
  student_survey,
  datasets=1,
  save_all_iterations=False,
  random_state=42
)

# Run the MICE algorithm for 2 iterations
kds.mice(2)



In [12]:
completed_dataset = kds.complete_data(dataset=0, inplace=False)
student_survey.describe() - completed_dataset.describe()

Unnamed: 0,S2MSPR12,S1MFALL09,S2MTCHTREAT,S2MTCHINTRST,S2MTCHEASY,S2MTCHTHINK,S2MTCHGIVEUP,S1MTESTS,S1MTEXTBOOK,S1MSKILLS,...,X2X1TXMSCR,X1SEX,X1RACE,X1SES_U,X3THIMATH9,X1TXMSCR,W1PARENT,sex,race,SES
count,0.0,0.0,-243.0,-243.0,-252.0,-257.0,-271.0,-67.0,-98.0,-114.0,...,0.0,0.0,0.0,-6624.0,-1057.0,0.0,0.0,0.0,0.0,0.0
mean,0.0,0.0,-7.6e-05,-0.000708,0.000404,-0.001031,-0.001342,-0.000416,-0.000757,-0.000161,...,0.0,0.0,0.0,0.317923,-0.000158,0.0,0.0,0.0,0.0,0.0
std,0.0,0.0,0.000428,0.001163,0.000924,0.000662,-5e-05,0.000525,0.000343,0.000242,...,0.0,0.0,0.0,0.011863,0.001131,0.0,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.25685,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.518,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.38405,0.0,0.0,0.0,0.0,0.0,0.0
max,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
#student_survey = student_survey[(student_survey[all_columns] >= 0).all(axis=1)]
student_survey

Unnamed: 0,S2MSPR12,S1MFALL09,S2MTCHTREAT,S2MTCHINTRST,S2MTCHEASY,S2MTCHTHINK,S2MTCHGIVEUP,S1MTESTS,S1MTEXTBOOK,S1MSKILLS,...,X2X1TXMSCR,X1SEX,X1RACE,X1SES_U,X3THIMATH9,X1TXMSCR,W1PARENT,sex,race,SES
0,1,1,2.0,2.0,2.0,2.0,3.0,1.0,2.0,2.0,...,69.4994,1,8,1.6907,6.0,50.4919,470.250141,0,0,1.6907
1,1,1,3.0,3.0,3.0,2.0,2.0,2.0,2.0,2.0,...,49.4710,2,8,,2.0,35.8045,224.455466,1,0,-0.3923
2,1,1,3.0,1.0,1.0,2.0,1.0,1.0,3.0,2.0,...,77.3584,2,3,1.1271,5.0,56.0477,185.301339,1,1,1.1271
6,1,1,2.0,4.0,4.0,2.0,3.0,2.0,2.0,2.0,...,52.7841,2,8,,3.0,38.4063,379.440827,1,0,-0.4774
9,1,1,2.0,3.0,2.0,2.0,3.0,1.0,1.0,1.0,...,76.6330,2,8,0.1081,4.0,55.5463,242.626125,1,0,0.1081
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23497,1,1,3.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,...,70.7533,1,8,1.2713,,51.3889,73.287998,0,0,1.2713
23499,1,1,3.0,3.0,2.0,1.0,1.0,2.0,1.0,2.0,...,46.4682,2,5,,3.0,33.3823,10.120169,1,2,-1.3350
23500,1,1,1.0,1.0,1.0,1.0,1.0,2.0,3.0,2.0,...,53.2958,2,8,,4.0,38.8000,98.823515,1,0,-0.0031
23501,1,1,2.0,4.0,4.0,3.0,4.0,2.0,2.0,2.0,...,72.6301,1,8,0.7236,4.0,52.7281,262.402860,0,0,0.7236


### Combine columns into predictors

In [14]:
low = [i[0] for i in low_teacher_support]
student_df = completed_dataset[low].dropna()
teacher_var = (student_df.sum(axis=1)/len(low)).to_frame()
teacher_var.describe()
teacher_var.columns = ['teacher']
teacher_var

Unnamed: 0,teacher
0,2.2
1,2.6
2,1.6
6,3.0
9,2.4
...,...
23497,1.4
23499,2.0
23500,1.0
23501,3.4


In [15]:
# he scale was reverse-coded so that high scores signified strong math ability self-concepts (1 = Strongly disagree, 4 = Strongly agree).
reverse_code = {
    'S1MTESTS': {1: 4, 2: 3, 3: 2, 4: 1},
    'S1MTEXTBOOK': {1: 4, 2: 3, 3: 2, 4: 1},
    'S1MSKILLS': {1: 4, 2: 3, 3: 2, 4: 1},
    'S1MASSEXCL': {1: 4, 2: 3, 3: 2, 4: 1},
}

ability = [i[0] for i in ability_self_concept]
ability_df = completed_dataset[ability].dropna()
ability_df = ability_df.replace(reverse_code)
ability_var = (ability_df.sum(axis=1)/len(ability)).to_frame()
ability_var.columns = ['ability']
ability_var.describe()

Unnamed: 0,ability
count,14575.0
mean,2.975077
std,0.643115
min,1.0
25%,2.75
50%,3.0
75%,3.5
max,4.0


In [16]:
parent = [i[0] for i in parental_support]
parent_df = completed_dataset[parent]
parent_df

parental_var = (parent_df.sum(axis=1)/len(parent)).to_frame()
parental_var.columns = ['parents']
parental_var.describe()

Unnamed: 0,parents
count,14575.0
mean,0.613914
std,0.236105
min,0.0
25%,0.5
50%,0.6
75%,0.8
max,1.0


In [17]:
level = [i[0] for i in highest_level_math]
level_var = completed_dataset[level]
level_var.columns = ['base_level']
level_var.describe()

Unnamed: 0,base_level
count,14575.0
mean,4.36789
std,1.45867
min,0.0
25%,4.0
50%,4.0
75%,5.0
max,11.0


In [18]:
acheive = [i[0] for i in math_acheivement_score]
acheive_var = completed_dataset[acheive]
acheive_var.describe()
acheive_var['math'] = acheive_var['X2TXMSCR']
acheive_var = acheive_var.drop(['X2TXMSCR'], axis=1)
acheive_var.describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  acheive_var['math'] = acheive_var['X2TXMSCR']


Unnamed: 0,math
count,14575.0
mean,69.946709
std,18.729319
min,25.0057
25%,58.91525
50%,69.4297
75%,85.15615
max,115.1


In [19]:
base = [i[0] for i in base_year_score]
base_var = completed_dataset[base]
base_var.describe()
base_var['base_math'] = base_var['X1TXMSCR']
base_var = base_var.drop(['X1TXMSCR'], axis=1)
base_var.describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  base_var['base_math'] = base_var['X1TXMSCR']


Unnamed: 0,base_math
count,14575.0
mean,42.115971
std,11.610209
min,15.8641
25%,34.50865
50%,42.2088
75%,50.709
max,69.9317


In [20]:
completed_dataset['SES'].describe()

count    14575.000000
mean         0.148680
std          0.789495
min         -1.906800
25%         -0.427050
50%          0.092500
75%          0.676300
max          2.978300
Name: SES, dtype: float64

In [21]:
full_df = pd.concat([acheive_var, 
                     teacher_var, 
                     ability_var, 
                     parental_var, 
                     completed_dataset['sex'], 
                     completed_dataset['SES'], 
                     base_var,
                     level_var], axis=1)

In [44]:
# SES test - shift to non neg
full_df['SES'] = full_df['SES'] + abs(min(full_df['SES']))
min(full_df['SES'])

0.0

In [56]:
math = pd.qcut(full_df['math'], q=10)
math = math.apply(lambda row : row.mid).astype(int)
math.astype(int).to_frame()
math.unique()

array([104,  71,  76,  58,  83,  48,  91,  63,  67,  33])

In [43]:
full_df.corr()
# YAY! NOTE: probably using the wrong SES - try and figure out which one they actually used...

Unnamed: 0,math,teacher,ability,parents,sex,SES,base_math,base_level
math,1.0,-0.119108,0.299503,0.14018,-0.022433,0.414977,0.745042,0.358017
teacher,-0.119108,1.0,-0.136804,-0.02966,0.033255,-0.049281,-0.083906,-0.043894
ability,0.299503,-0.136804,1.0,0.082437,-0.106218,0.13143,0.301742,0.138489
parents,0.14018,-0.02966,0.082437,1.0,-0.063856,0.220776,0.129496,0.088178
sex,-0.022433,0.033255,-0.106218,-0.063856,1.0,-0.004499,-0.018885,0.023301
SES,0.414977,-0.049281,0.13143,0.220776,-0.004499,1.0,0.408821,0.224066
base_math,0.745042,-0.083906,0.301742,0.129496,-0.018885,0.408821,1.0,0.380554
base_level,0.358017,-0.043894,0.138489,0.088178,0.023301,0.224066,0.380554,1.0


In [23]:
RACE_MAP = {
    0: "White",
    1: "Black",
    2: "Hispanic",
    3: "Asian",
    4: "Other",
    5: "Other"
}

SEX_MAP = {
    0: "Male",
    1: "Female"
}

In [24]:
full_df_regression = full_df.copy()
full_df_regression['race'] = completed_dataset['race']
full_df_regression = full_df_regression.replace({'sex': SEX_MAP,
                                                 'race': RACE_MAP})

In [25]:
full_df_regression

Unnamed: 0,math,teacher,ability,parents,sex,SES,base_math,base_level,race
0,99.1403,2.2,3.50,0.6,Male,1.6907,50.4919,6.0,White
1,72.4904,2.6,3.25,0.0,Female,-0.3923,35.8045,2.0,White
2,75.4243,1.6,3.25,0.8,Female,1.1271,56.0477,5.0,Black
6,60.0290,3.0,3.00,0.9,Female,-0.4774,38.4063,3.0,White
9,86.8286,2.4,4.00,0.6,Female,0.1081,55.5463,4.0,White
...,...,...,...,...,...,...,...,...,...
23497,85.8476,1.4,3.00,0.9,Male,1.2713,51.3889,4.0,White
23499,60.8373,2.0,3.50,0.9,Female,-1.3350,33.3823,3.0,Hispanic
23500,60.9564,1.0,2.75,0.6,Female,-0.0031,38.8000,4.0,White
23501,65.1187,3.4,3.00,0.7,Male,0.7236,52.7281,4.0,White


In [35]:
new_weights = np.array(weights.array)/10
new_weights

array([47.0250141, 22.4455466, 18.5301339, ...,  9.8823515, 26.240286 ,
       16.3932794])

In [38]:
from statsmodels.regression.linear_model import WLS

model_lts = WLS.from_formula(
    'math ~ teacher + C(sex, Treatment(reference="Male")) + C(race, Treatment(reference="White")) + SES + base_math + base_level',
    data=full_df_regression)
regression_lts = model_lts.fit(method='pinv')

In [39]:
regression_lts.summary2()

0,1,2,3
Model:,WLS,Adj. R-squared:,0.581
Dependent Variable:,math,AIC:,114116.5687
Date:,2022-08-11 08:34,BIC:,114192.4393
No. Observations:,14575,Log-Likelihood:,-57048.0
Df Model:,9,F-statistic:,2244.0
Df Residuals:,14565,Prob (F-statistic):,0.0
R-squared:,0.581,Scale:,147.09

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,25.7019,0.6636,38.7337,0.0000,24.4013,27.0026
"C(sex, Treatment(reference=""Male""))[T.Female]",-0.3701,0.2012,-1.8399,0.0658,-0.7645,0.0242
"C(race, Treatment(reference=""White""))[T.Asian]",3.3572,0.3770,8.9053,0.0000,2.6182,4.0961
"C(race, Treatment(reference=""White""))[T.Black]",-1.6682,0.3606,-4.6268,0.0000,-2.3749,-0.9615
"C(race, Treatment(reference=""White""))[T.Hispanic]",-0.0933,0.3008,-0.3102,0.7564,-0.6829,0.4963
"C(race, Treatment(reference=""White""))[T.Other]",-0.0393,0.3448,-0.1139,0.9093,-0.7152,0.6367
teacher,-1.9070,0.1924,-9.9134,0.0000,-2.2841,-1.5299
SES,2.9115,0.1434,20.3046,0.0000,2.6304,3.1926
base_math,1.0440,0.0103,101.7891,0.0000,1.0239,1.0641

0,1,2,3
Omnibus:,398.472,Durbin-Watson:,1.997
Prob(Omnibus):,0.0,Jarque-Bera (JB):,523.11
Skew:,-0.324,Prob(JB):,0.0
Kurtosis:,3.665,Condition No.:,300.0


In [40]:
model_sc = WLS.from_formula(
    'math ~ ability + C(sex, Treatment(reference="Male")) + C(race, Treatment(reference="White")) + SES + base_math + base_level',
    data=full_df_regression,
    freq_weights=np.array(weights.array))
regression_sc = model_sc.fit(method='pinv')
regression_sc.summary2()



0,1,2,3
Model:,WLS,Adj. R-squared:,0.584
Dependent Variable:,math,AIC:,114011.9676
Date:,2022-08-11 08:35,BIC:,114087.8382
No. Observations:,14575,Log-Likelihood:,-56996.0
Df Model:,9,F-statistic:,2271.0
Df Residuals:,14565,Prob (F-statistic):,0.0
R-squared:,0.584,Scale:,146.04

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,15.8259,0.6035,26.2227,0.0000,14.6429,17.0088
"C(sex, Treatment(reference=""Male""))[T.Female]",-0.1280,0.2015,-0.6350,0.5254,-0.5229,0.2670
"C(race, Treatment(reference=""White""))[T.Asian]",3.3090,0.3756,8.8093,0.0000,2.5727,4.0453
"C(race, Treatment(reference=""White""))[T.Black]",-2.0540,0.3609,-5.6916,0.0000,-2.7613,-1.3466
"C(race, Treatment(reference=""White""))[T.Hispanic]",-0.1471,0.2998,-0.4906,0.6237,-0.7346,0.4405
"C(race, Treatment(reference=""White""))[T.Other]",-0.1249,0.3437,-0.3633,0.7164,-0.7985,0.5488
ability,2.3584,0.1652,14.2790,0.0000,2.0347,2.6822
SES,2.9105,0.1429,20.3728,0.0000,2.6305,3.1906
base_math,1.0115,0.0106,95.8235,0.0000,0.9908,1.0322

0,1,2,3
Omnibus:,391.23,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,517.073
Skew:,-0.318,Prob(JB):,0.0
Kurtosis:,3.669,Condition No.:,273.0


In [41]:
model_sc = WLS.from_formula(
    'math ~ (ability * teacher * parents) + C(sex, Treatment(reference="Male")) + C(race, Treatment(reference="White")) + SES + base_math + base_level',
    data=full_df_regression,
    freq_weights=np.array(weights.array))
regression_sc = model_sc.fit(method='pinv')
regression_sc.summary2()



0,1,2,3
Model:,WLS,Adj. R-squared:,0.586
Dependent Variable:,math,AIC:,113925.6793
Date:,2022-08-11 08:35,BIC:,114047.0723
No. Observations:,14575,Log-Likelihood:,-56947.0
Df Model:,15,F-statistic:,1378.0
Df Residuals:,14559,Prob (F-statistic):,0.0
R-squared:,0.587,Scale:,145.12

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,26.0742,5.4636,4.7723,0.0000,15.3648,36.7836
"C(sex, Treatment(reference=""Male""))[T.Female]",-0.0349,0.2013,-0.1731,0.8626,-0.4295,0.3598
"C(race, Treatment(reference=""White""))[T.Asian]",3.3065,0.3757,8.8012,0.0000,2.5701,4.0429
"C(race, Treatment(reference=""White""))[T.Black]",-2.1416,0.3599,-5.9503,0.0000,-2.8471,-1.4361
"C(race, Treatment(reference=""White""))[T.Hispanic]",-0.2085,0.2989,-0.6974,0.4856,-0.7945,0.3775
"C(race, Treatment(reference=""White""))[T.Other]",-0.1226,0.3426,-0.3577,0.7206,-0.7941,0.5490
ability,-0.2616,1.8051,-0.1449,0.8848,-3.7999,3.2767
teacher,-3.4452,2.2957,-1.5007,0.1335,-7.9451,1.0547
ability:teacher,0.6684,0.7698,0.8682,0.3853,-0.8405,2.1773

0,1,2,3
Omnibus:,383.612,Durbin-Watson:,1.995
Prob(Omnibus):,0.0,Jarque-Bera (JB):,504.085
Skew:,-0.315,Prob(JB):,0.0
Kurtosis:,3.658,Condition No.:,5012.0


In [30]:
table_3 = {}
model_lts = WLS.from_formula(
    'math ~ teacher + C(sex, Treatment(reference="Male")) + C(race, Treatment(reference="White")) + SES + base_math + base_level',
    data=full_df_regression,
    freq_weights=np.array(weights.array))
regression_lts = model_lts.fit(method='pinv')
table_3['model_1'] = regression_lts.summary2()

model_sc = WLS.from_formula(
    'math ~ ability + C(sex, Treatment(reference="Male")) + C(race, Treatment(reference="White")) + SES + base_math + base_level',
    data=full_df_regression,
    freq_weights=np.array(weights.array))
regression_sc = model_sc.fit(method='pinv')
table_3['model_2'] = regression_sc.summary2()

model_sc_lts = WLS.from_formula(
    'math ~ (ability * teacher) + C(sex, Treatment(reference="Male")) + C(race, Treatment(reference="White")) + SES + base_math + base_level',
    data=full_df_regression,
    freq_weights=np.array(weights.array))
regression_sc_lts = model_sc_lts.fit(method='pinv')
table_3['model_3'] = regression_sc_lts.summary2()

model_ps_sc = WLS.from_formula(
    'math ~ (ability * teacher) + C(sex, Treatment(reference="Male")) + C(race, Treatment(reference="White")) + SES + base_math + base_level',
    data=full_df_regression,
    freq_weights=np.array(weights.array))
regression_ps_sc = model_ps_sc.fit(method='pinv')
table_3['model_2'] = regression_sc_lts.summary2()

model_all = WLS.from_formula(
    'math ~ (ability * teacher * parents) + C(sex, Treatment(reference="Male")) + C(race, Treatment(reference="White")) + SES + base_math + base_level',
    data=full_df_regression,
    freq_weights=np.array(weights.array))
regression_all = model_all.fit(method='pinv')
regression_all.summary2()



0,1,2,3
Model:,WLS,Adj. R-squared:,0.586
Dependent Variable:,math,AIC:,113925.6793
Date:,2022-08-11 08:33,BIC:,114047.0723
No. Observations:,14575,Log-Likelihood:,-56947.0
Df Model:,15,F-statistic:,1378.0
Df Residuals:,14559,Prob (F-statistic):,0.0
R-squared:,0.587,Scale:,145.12

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,26.0742,5.4636,4.7723,0.0000,15.3648,36.7836
"C(sex, Treatment(reference=""Male""))[T.Female]",-0.0349,0.2013,-0.1731,0.8626,-0.4295,0.3598
"C(race, Treatment(reference=""White""))[T.Asian]",3.3065,0.3757,8.8012,0.0000,2.5701,4.0429
"C(race, Treatment(reference=""White""))[T.Black]",-2.1416,0.3599,-5.9503,0.0000,-2.8471,-1.4361
"C(race, Treatment(reference=""White""))[T.Hispanic]",-0.2085,0.2989,-0.6974,0.4856,-0.7945,0.3775
"C(race, Treatment(reference=""White""))[T.Other]",-0.1226,0.3426,-0.3577,0.7206,-0.7941,0.5490
ability,-0.2616,1.8051,-0.1449,0.8848,-3.7999,3.2767
teacher,-3.4452,2.2957,-1.5007,0.1335,-7.9451,1.0547
ability:teacher,0.6684,0.7698,0.8682,0.3853,-0.8405,2.1773

0,1,2,3
Omnibus:,383.612,Durbin-Watson:,1.995
Prob(Omnibus):,0.0,Jarque-Bera (JB):,504.085
Skew:,-0.315,Prob(JB):,0.0
Kurtosis:,3.658,Condition No.:,5012.0


In [31]:
regression_all.get_robustcov_results().summary()

0,1,2,3
Dep. Variable:,math,R-squared:,0.587
Model:,WLS,Adj. R-squared:,0.586
Method:,Least Squares,F-statistic:,1645.0
Date:,"Thu, 11 Aug 2022",Prob (F-statistic):,0.0
Time:,08:33:08,Log-Likelihood:,-56947.0
No. Observations:,14575,AIC:,113900.0
Df Residuals:,14559,BIC:,114000.0
Df Model:,15,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,26.0742,6.701,3.891,0.000,12.939,39.209
"C(sex, Treatment(reference=""Male""))[T.Female]",-0.0349,0.202,-0.173,0.863,-0.430,0.361
"C(race, Treatment(reference=""White""))[T.Asian]",3.3065,0.362,9.145,0.000,2.598,4.015
"C(race, Treatment(reference=""White""))[T.Black]",-2.1416,0.368,-5.816,0.000,-2.863,-1.420
"C(race, Treatment(reference=""White""))[T.Hispanic]",-0.2085,0.309,-0.675,0.499,-0.814,0.397
"C(race, Treatment(reference=""White""))[T.Other]",-0.1226,0.338,-0.362,0.717,-0.786,0.540
ability,-0.2616,2.166,-0.121,0.904,-4.508,3.985
teacher,-3.4452,2.788,-1.236,0.217,-8.910,2.019
ability:teacher,0.6684,0.918,0.728,0.467,-1.132,2.468

0,1,2,3
Omnibus:,383.612,Durbin-Watson:,1.995
Prob(Omnibus):,0.0,Jarque-Bera (JB):,504.085
Skew:,-0.315,Prob(JB):,3.46e-110
Kurtosis:,3.658,Cond. No.,5010.0


In [32]:
cov = regression_all.cov_params()
#ind_col = list(cov.columns).index('ability:teacher:parents')

In [33]:
regression_all.cov_params()

Unnamed: 0,Intercept,"C(sex, Treatment(reference=""Male""))[T.Female]","C(race, Treatment(reference=""White""))[T.Asian]","C(race, Treatment(reference=""White""))[T.Black]","C(race, Treatment(reference=""White""))[T.Hispanic]","C(race, Treatment(reference=""White""))[T.Other]",ability,teacher,ability:teacher,parents,ability:parents,teacher:parents,ability:teacher:parents,SES,base_math,base_level
Intercept,29.851181,-0.032089,-0.022863,-0.034085,-0.040197,-0.01941,-9.56704,-12.158622,3.938922,-42.263329,13.576525,17.343615,-5.597025,0.02262,-0.002638,-0.010797
"C(sex, Treatment(reference=""Male""))[T.Female]",-0.032089,0.040538,-0.000421,-0.000418,-0.000517,-0.000188,0.004717,0.000645,-0.000628,-0.00624,0.003268,0.004865,-0.001382,-0.000423,-4e-06,-0.000551
"C(race, Treatment(reference=""White""))[T.Asian]",-0.022863,-0.000421,0.141141,0.014105,0.01523,0.016583,0.007554,0.013381,-0.004268,0.059884,-0.017594,-0.021463,0.007944,-0.000638,-0.000484,-0.00118
"C(race, Treatment(reference=""White""))[T.Black]",-0.034085,-0.000418,0.014105,0.129538,0.021438,0.018856,-0.003767,0.006208,-0.001621,0.005066,-0.002647,-0.006029,0.002316,0.00307,0.000581,-9.8e-05
"C(race, Treatment(reference=""White""))[T.Hispanic]",-0.040197,-0.000517,0.01523,0.021438,0.08937,0.019304,0.005272,0.007322,-0.002662,0.018936,-0.00848,-0.006587,0.003268,0.009148,0.00014,-5.3e-05
"C(race, Treatment(reference=""White""))[T.Other]",-0.01941,-0.000188,0.016583,0.018856,0.019304,0.117387,-0.002028,0.000411,0.000382,-0.006119,0.003739,0.002748,-0.001656,0.00292,3.4e-05,0.000516
ability,-9.56704,0.004717,0.007554,-0.003767,0.005272,-0.002028,3.258551,3.939868,-1.348696,13.570052,-4.595729,-5.59923,1.908212,0.000513,-0.000327,-0.000428
teacher,-12.158622,0.000645,0.013381,0.006208,0.007322,0.000411,3.939868,5.27034,-1.717874,17.351469,-5.602611,-7.543354,2.448937,4.7e-05,6.6e-05,-1.3e-05
ability:teacher,3.938922,-0.000628,-0.004268,-0.001621,-0.002662,0.000382,-1.348696,-1.717874,0.592587,-5.594988,1.908648,2.447936,-0.840728,0.000107,-1.2e-05,-0.000201
parents,-42.263329,-0.00624,0.059884,0.005066,0.018936,-0.006119,13.570052,17.351469,-5.594988,69.832046,-22.263502,-28.749493,9.208045,-0.006149,0.000467,-0.008714


In [34]:
cov.loc['parents:ability']['parents']

KeyError: 'parents:ability'