In [3]:
import sys
sys.path.append('../')

# Generate Individual Focal Students

The individual focal students for each block group are going to be generated at random. We are going to use the block groupes defined by SFUSD. The available data allows us to get rough estimates of the following probabilities for each block group:

- Probability a student is AALPI: $P(AALPI)$
- Probability a student is FRL: $P(FRL)$
- Probability a student is both AALPI and FRL: $P(AALPI \cap FRL)$
- Probability a student is FRL conditional on him being AALPI: $$P(FRL \mid AALPI) = \frac{P(AALPI \cap FRL)}{P(AALPI)}$$
- Probability a student is FRL conditional on him not being AALPI: $$P(FRL \mid \overline{AALPI}) = \frac{P(FRL)-P(AALPI \cap FRL)}{1-P(AALPI)}$$

Since some blocks have a small number of students we are going to use the mean of the posterior probability of a Bernoulli distribution with a uniform prior as probability estimates. Let $\theta$ be the parameter of the Bernoulli disttribution, $m$ the counts of interest and $n$ the total counts, we can get our probability estimates as: 

$$\hat{\theta} = \mathbb{E}\left[\theta\mid m,n\right]=\frac{m+1}{n+2}$$

For example, we can estimate the probability of a student being AALPI as

$$P(AALPI)=\frac{\textit{counts of AALPI}+1}{\textit{total counts}+2}$$

# Setting up the tiebreaker for the simulator engine using the class SimulationPreprocessing

To run the counterfactual simulation, we first need to add the new tiebreakers to the student data. Under the current setup of the dssg_sfusd repository (this repository) and the sfusd-project repository, we have to create and maintain a separate version of the student data that includes a column for each tiebreaker we wish to simulate.

The first step for this is to update (or create if it doesn't exist) a version of the student data that has a binary column indicating if a student has the tiebreaker. We can do this by using the SimulationPreprocessing in three steps:

1. compute the new columns
2. update the student data
3. save the updated student data

## Step 1: Compute new columns

For this step, we initialize the classifier or tiebreaker of interest and add it as a column in the student data. The student data for a particular period is loaded by initializing the `SimulationPreprocessing`:

In [None]:
from src.d02_intermediate.simulation_preprocessing import SimulationPreprocessing
frl_key = 'tk5'
sp = SimulationPreprocessing(frl_key=frl_key, period="1819")

### Examples of models that have been loaded

In [None]:
from src.d04_modeling.ctip_classifier import CtipClassifier

sp.add_frl_labels()

tiebreaker = 'ctip1'
fpr = 0.04
model = CtipClassifier(positive_group='nBoth', frl_key=frl_key)

sp.add_equity_tiebreaker(model, params=fpr, tiebreaker=tiebreaker)
print(model.get_roc())

In [None]:
from src.d04_modeling.naive_classifier import NaiveClassifier

sp.add_frl_labels()

tiebreaker = 'special014'
fpr = 0.14
model = NaiveClassifier(positive_group='nAAFRL', frl_key=frl_key, proportion=True)

sp.add_equity_tiebreaker(model, params=fpr, tiebreaker=tiebreaker)

In [None]:
from src.d04_modeling.naive_classifier import NaiveClassifier

sp.add_frl_labels()

tiebreaker = 'naive004'
fpr = 0.04
model = NaiveClassifier(positive_group='nBoth', frl_key=frl_key, proportion=True)

sp.add_equity_tiebreaker(model, params=fpr, tiebreaker=tiebreaker)

In [None]:
from src.d04_modeling.naive_classifier import NaiveClassifier

sp.add_frl_labels()

tiebreaker = 'naive016'
fpr = 0.16
model = NaiveClassifier(positive_group='nBoth', frl_key=frl_key, proportion=True)

sp.add_equity_tiebreaker(model, params=fpr, tiebreaker=tiebreaker)

In [None]:
from src.d04_modeling.knapsack_classifier import KnapsackClassifier

sp.add_frl_labels()

tiebreaker = 'knapsack008'
fpr = 0.08
positive_group = 'nFocal'
model = KnapsackClassifier(positive_group=positive_group, load=True,
                           frl_key=frl_key, run_name="%s_%s.pkl" % (frl_key, positive_group))

sp.add_equity_tiebreaker(model, params=fpr, tiebreaker=tiebreaker)

In [None]:
from src.d04_modeling.knapsack_classifier import KnapsackClassifier


sp.add_frl_labels()

tiebreaker = 'knapsack014'
fpr = 0.14
positive_group = 'nBoth'
model = KnapsackClassifier(positive_group=positive_group, load=True,
                           frl_key=frl_key, run_name="%s_%s.pkl" % (frl_key, positive_group))

sp.add_equity_tiebreaker(model, params=fpr, tiebreaker=tiebreaker)

In [None]:
from src.d04_modeling.propositional_classifier import andClassifier

sp.add_frl_labels()

tiebreaker = 'pc1020_050'
params = [0.2, 0.5]
pc1 = andClassifier(["pctFocal", "BG_pctFocal"], group_criterion="nbhd", frl_key=frl_key)

sp.add_equity_tiebreaker(pc1, params=params, tiebreaker=tiebreaker)

In [None]:
from src.d04_modeling.propositional_classifier import andClassifier

sp.add_frl_labels()

tiebreaker = 'pc2025_040'
params = [0.25, 0.4]
positive_group = 'nBoth'
pc2 = andClassifier(["pctBoth", "BG_pctBoth"], positive_group=positive_group, group_criterion="nbhd", frl_key=frl_key)

sp.add_equity_tiebreaker(pc2, params=params, tiebreaker=tiebreaker)
print(pc2.get_roc([params]))

In [None]:
from src.d04_modeling.propositional_classifier import andClassifier, orClassifier

sp.add_frl_labels()

tiebreaker = 'pc3_035'
params = 0.30
positive_group = 'nBoth'

eligibility_classifier = orClassifier(["Housing", "Redline"], binary_var=[0,1])
pc3 = andClassifier(["pctBoth"], positive_group=positive_group, eligibility_classifier=eligibility_classifier, frl_key=frl_key)

sp.add_equity_tiebreaker(pc3, params=params, tiebreaker=tiebreaker)
print(pc3.get_roc([params]))
pc3.plot_map(params=params)

In [4]:
from src.d04_modeling.ctip_classifier import CtipClassifier

sp.add_frl_labels()

tiebreaker = 'ctip1'
fpr = 0.04
model = CtipClassifier(positive_group='nAAFRL', frl_key=frl_key)

sp.add_equity_tiebreaker(model, params=fpr, tiebreaker=tiebreaker)
print(model.get_roc())

Adding African-American counts to FRL data...
Loading Block FRL data...0.1855
Loading Block Demographic data...0.1695
Loading Student Demographic data...4.1036
Int64Index([60750179021025, 60750179021001, 60750179021027, 60750179021035,
            60750179021026, 60750179021034, 60750179021010, 60750179021017,
            60750179021037, 60750179021031,
            ...
            60750605023000, 60750264041003, 60750264041004, 60750605022001,
            60750605023005, 60750605023004, 60750605023001, 60750605023003,
            60750605023002, 60750605023007],
           dtype='int64', name='geoid', length=494)
Ratio of students recieving the equity tiebreaker: 0.16
        fpr       tpr
0  0.225321  0.693583


In [29]:
from src.d04_modeling.propositional_classifier import andClassifier, orClassifier

sp.add_frl_labels()

tiebreaker = 'pc4_012'
params = 0.12
positive_group = 'nAAFRL'

eligibility_classifier = orClassifier(["Housing", "Redline"], binary_var=[0,1])
pc4 = andClassifier(["pctAAFRL"], positive_group=positive_group, eligibility_classifier=eligibility_classifier, frl_key=frl_key)

sp.add_equity_tiebreaker(pc4, params=params, tiebreaker=tiebreaker)
print(pc3.get_roc([params]))

Int64Index([60750101002006, 60750101002007, 60750107003004, 60750614002007,
            60750614002010, 60750614002011, 60750614001000, 60750614001018,
            60750614001017, 60750229022005,
            ...
            60750312022019, 60750605023001, 60750314002003, 60750314005004,
            60750313021000, 60750314002007, 60750314004002, 60750313021007,
            60750313021005, 60750354001005],
           dtype='int64', name='geoid', length=144)
Ratio of students recieving the equity tiebreaker: 0.07
        tpr       fpr
0  0.703847  0.092585


## Step 2: Update student data

Once we have setup the new column for our student data we can update the student data that is going to be used for the simulation: `student_out`. The following method loads the student data used for the simulation, checks that it is consistent with the student data that we are going to use to update the tiebreaker column and then generate a new student data dataframe that can be used to overwrite the old one.

__Note:__ The method `update_student_data` only updates the `student_out` dataframe if it doesn't already have the tiebreaker. If we wish to add it again and overwrite the previous column we can use the method `sp.set_recalculate(True)`

In [None]:
# sp.set_recalculate(True)

In [30]:
student_out = sp.update_student_data(tiebreaker)

Loading student data from:
 /share/data/school_choice_equity/simulator_data/student/drop_optout_1819.csv
Updating pc4_012 in student data...


## Step 3: Save student data

Finally, once we have an updated version of the student data `student_out` we can export it to the corresponding directory. This updated student data is saved in the `/share/data/school_choice_equity/simulator_data/student/` directory, which is different from the original directory with the "raw" data.

In [31]:
sp.save_student_data(student_out)

Saving to:
  /share/data/school_choice_equity/simulator_data/student/drop_optout_1819.csv
