# Overview
This script generates the background vocabulary for our pilot simulations, which will be shared at SSSR in Copenhagen in July 2024. To do this we first read-in the My Sidewalks data, count how many total words are present there and then sample from other relevant language sources to generate the background vocabularies for training. This script also generates the test sets for this training condition. See each section below for more information.

In [1]:
import pandas as pd
import random

import nltk
cmuduct = nltk.corpus.cmudict.dict()
random.seed(765)

In [2]:
cmu = [word.lower() for word in cmuduct.keys() if word.isalpha()]
acbc = pd.read_csv('data/vocabulary/tidycorpus.csv')['token'].to_list()

acbc = [word.lower() for word in acbc if not isinstance(word, float)]
acbc = [word.lower() for word in acbc if word.isalpha()]


  acbc = pd.read_csv('data/vocabulary/tidycorpus.csv')['token'].to_list()


In [3]:
programs = pd.read_csv('data/combined_programs.csv')
my_sidewalks = []
trade_books = []

for i, row in programs.iterrows():
    if "Sidewalks" in row['program_name']:
        if isinstance(row.word_raw, str):
            my_sidewalks.append(row.word_raw.lower())
    if "LLI" in row["program_name"]:
        if isinstance(row.word_raw, str):
            trade_books.append(row.word_raw.lower())
            
            
my_sidewalks = [word for word in my_sidewalks if word in cmu]

In [4]:
trade_books = [word.lower() for word in trade_books if isinstance(word, str) & word.isalpha()]
trade_books = [word for word in trade_books if word in cmu]

My Sidewalks has this many total words:

In [5]:
N = len(my_sidewalks)
print(N)

10473


And this many unique words:

In [6]:
print(len(set(my_sidewalks)))

1282


The goal of constructing the background vocabulary is to identify additional sets of (total) words (i.e., tokens, not types) that comprise a certain proportion of the overall training environment. We will implement this such that we have conditions where the program words represent 25%, 50%, 75%, and 100% of the overall training environment.

In [7]:
program_details = []

for i, row in programs.iterrows():
    if "Sidewalks" in row['program_name']:
        program_details.append(row.program_name)

In [8]:
set(program_details)

{'My Sidewalks K',
 'My Sidewalks Level A Unit 1',
 'My Sidewalks Level A Unit 2',
 'My Sidewalks Level A Unit 3',
 'My Sidewalks Level A Unit 4',
 'My Sidewalks Level A Unit 5'}

We will set the overall vocabulary size to be approximately 10K total words (the exact value will be set by reading all the My Sidewalks books one time; 10,651 total words). This is a realistic and tractable number of words to use as training examples. For the condition where the decodable texts are 100% of the training set, then the My Sidewalks books will all be read once through. In the case where the decodable texts represent 75% of the overall vocabulary, 25% of the words will be sampled from children's sources. Likewise for the 50% and 25% conditions.

For each level of `proportion` we will take 20 random draws for the background vocabulary. This will allow us to look at 20 different models for each proportion in each condition. Important here is that for a given draw of background vocabulary for a particular proportion, that draw will be used for both the My Sidewalks (decodable) and trade books model. For example, for sample #1 of background vocabulary of 25% (where 75% of the words are from the program and 25% are from the background) the same set of 25% will be used for both the decodable and trade book models. Then, sample #2 will be a different draw of the 25% of words making up the background vocabulary, and so on.

# Data for our two primary conditions: my_sidewalks (decodable) and trade_books (LLI)

Each proportion will be associated with a dictionary within which a sample of that proportion will be allocated. The key will be the sample ID and the value will be the sample of words of a given proportion. We will then use these to concatenate to the program samples before writing to file.

In [9]:
samples_n = 20

background_75 = {}
background_50 = {}
background_25 = {}
background_100 = {}

## Programs 100% Background 0%
We have a focal condition ("my sidewalks"/ decodable) and a comparison condition ("trade books"). The comparison condition will be based on the LLI books (we think of them as "trade books" for our purposes). We will draw enough texts comprising 10,476 words (i.e., the number of total words in My Sidewalks).

The My Sidewalks set is the 100% set for that program, so we don't have to sample it. Below is the sample of LLI books for this proportion. Remember that for these, we don't have resamples because it is a proportion of 1. (though at some point we could look at resamples of the LLI books)

In [10]:
proportion = 1
treatment_n = round(proportion*N)
background_n = N-treatment_n
assert treatment_n + background_n == N

In [11]:
starting_points = [random.choice(range(len(trade_books) - N)) for i in range(20)]

In [12]:
trade_books_100 = {}

for i in range(20):
    ending_point = starting_points[i] + treatment_n
    trade_books_100[i] = trade_books[starting_points[i]:ending_point]
    assert len(trade_books_100[i]) == N

In [13]:
OUTDIR = "data/SSSR2024/program_100_background_0/"

# my_sidewalks only has one draw in this condition because it is all of that program
outfile_my_sidewalks = OUTDIR + "my_sidewalks/" + "my_sidewalks_100_background_0" + ".csv"
with open(outfile_my_sidewalks, "w") as f:
    for word in my_sidewalks:
        f.write("{}\n".format(word))


for i in range(20):
    outfile_trade_books = OUTDIR + "trade_books/" + str(i) + "/trade_books_100_background_0_" + "sample_" + str(i) + ".csv"
    with open(outfile_trade_books, "w") as f:
        for word in trade_books_100[i]:
            f.write("{}\n".format(word))
        

# Programs 75% Background 25%

In [14]:
proportion = .75
treatment_n = round(proportion*N)
background_n = N-treatment_n

train_my_sidewalks = my_sidewalks[:treatment_n]
train_trade_books = trade_books[:treatment_n]


for i in range(samples_n):
    background_25[i] = [random.choice(acbc) for e in range(background_n)]

In [15]:
OUTDIR = "data/SSSR2024/program_75_background_25/"

for i in background_25.keys():
    outfile_my_sidewalks = OUTDIR + "my_sidewalks/" + str(i) + "/my_sidewalks_75_background_25_" + "sample_" + str(i) + ".csv"
    with open(outfile_my_sidewalks, "w") as f:
        for word in train_my_sidewalks + background_25[i]:
            f.write("{}\n".format(word))
    outfile_trade_books = OUTDIR + "trade_books/" + str(i) + "/trade_books_75_background_25_" + "sample_" + str(i) + ".csv"
    with open(outfile_trade_books, "w") as f:
        for word in train_trade_books + background_25[i]:
            f.write("{}\n".format(word))
        

# Programs 50% Background 50%

In [16]:
proportion = .50
treatment_n = round(proportion*N)
background_n = N-treatment_n

train_my_sidewalks = my_sidewalks[:treatment_n]
train_trade_books = trade_books[:treatment_n]

for i in range(samples_n):
    background_50[i] = [random.choice(acbc) for e in range(background_n)]

In [17]:
OUTDIR = "data/SSSR2024/program_50_background_50/"

for i in background_50.keys():
    outfile_my_sidewalks = OUTDIR + "my_sidewalks/" + str(i) + "/my_sidewalks_50_background_50_" + "sample_" + str(i) + ".csv"
    with open(outfile_my_sidewalks, "w") as f:
        for word in train_my_sidewalks + background_50[i]:
            f.write("{}\n".format(word))
    outfile_trade_books = OUTDIR + "trade_books/" + str(i) + "/trade_books_50_background_50_" + "sample_" + str(i) + ".csv"
    with open(outfile_trade_books, "w") as f:
        for word in train_trade_books + background_50[i]:
            f.write("{}\n".format(word))
        

# Program 25% Background 75%

In [18]:
proportion = .25
treatment_n = round(proportion*N)
background_n = N-treatment_n

train_my_sidewalks = my_sidewalks[:treatment_n]
train_trade_books = trade_books[:treatment_n]

for i in range(samples_n):
    background_75[i] = [random.choice(acbc) for e in range(background_n)]

In [19]:
OUTDIR = "data/SSSR2024/program_25_background_75/"

for i in background_75.keys():
    outfile_my_sidewalks = OUTDIR + "my_sidewalks/" + str(i) + "/my_sidewalks_25_background_75_" + "sample_" + str(i) + ".csv"
    with open(outfile_my_sidewalks, "w") as f:
        for word in train_my_sidewalks + background_75[i]:
            f.write("{}\n".format(word))
    outfile_trade_books = OUTDIR + "trade_books/" + str(i) + "/trade_books_25_background_75_" + "sample_" + str(i) + ".csv"
    with open(outfile_trade_books, "w") as f:
        for word in train_trade_books + background_75[i]:
            f.write("{}\n".format(word))
        

# Program 0% Background 100%
This is a set of children's words for basic comparison. We will have 20 samples of words drawn randomly from ACBC for a comparison for all models too. We have similar samples from the LLI/ trade books group above, but this is the condition where all words are from the background vocabulary. We only have one set for this purpose because we want to compare each program against a single sample of background vocabulary (in 20 different samples). 

In [20]:
proportion = 0
treatment_n = round(proportion*N)
background_n = N-treatment_n

train_my_sidewalks = my_sidewalks[:treatment_n]
train_trade_books = trade_books[:treatment_n]

for i in range(samples_n):
    background_100[i] = [random.choice(acbc) for e in range(background_n)]

In [21]:
OUTDIR = "data/SSSR2024/program_0_background_100/"

for i in background_100.keys():
    outfile = OUTDIR + str(i) + "/my_sidewalks_0_background_100_" + "sample_" + str(i) + ".csv"
    with open(outfile, "w") as f:
        for word in background_100[i]:
            f.write("{}\n".format(word))