# Overview
This script generates the background vocabulary for our pilot simulations, which will be shared at SSSR in Copenhagen in July 2024. To do this we first read-in the My Sidewalks data, count how many total words are present there and then sample from other relevant language sources to generate the background vocabularies for training. This script also generates the test sets for this training condition. See each section below for more information.

In [1]:
import pandas as pd
import random
random.seed(765)

In [9]:

programs = pd.read_csv('data/combined_programs.csv')
my_sidewalks = []
trade_books = []

for i, row in programs.iterrows():
    if "Sidewalks" in row['program_name']:
        if isinstance(row.word_raw, str):
            my_sidewalks.append(row.word_raw.lower())
    if "LLI" in row["program_name"]:
        if isinstance(row.word_raw, str):
            trade_books.append(row.word_raw.lower())

Read in ACBC words, CMU words

In [70]:
acbc = pd.read_csv('data/vocabulary/tidycorpus.csv')
acbc = [word.lower() for word in acbc.token.tolist() if isinstance(word, str) & word.isalpha()]
import csv

with open('data/cmu_words.csv', 'r') as f:
    reader = csv.reader(f, delimiter=",")
    for row in reader:
        cmu.append(row)

ParserError: Error tokenizing data. C error: EOF inside string starting at row 69909

My Sidewalks has this many total words:

In [22]:
N = len(my_sidewalks)
print(N)

10654


And this many unique words:

In [4]:
print(len(set(my_sidewalks)))

1351


The goal of constructing the background vocabulary is to identify additional sets of (total) words (i.e., tokens, not types) that comprise a certain proportion of the overall training environment. We will implement this such that we have conditions where the program words represent 25%, 50%, 75%, and 100% of the overall training environment.

In [5]:
program_details = []

for i, row in programs.iterrows():
    if "Sidewalks" in row['program_name']:
        program_details.append(row.program_name)

In [6]:
set(program_details)

{'My Sidewalks K',
 'My Sidewalks Level A Unit 1',
 'My Sidewalks Level A Unit 2',
 'My Sidewalks Level A Unit 3',
 'My Sidewalks Level A Unit 4',
 'My Sidewalks Level A Unit 5'}

We will set the overall vocabulary size to be approximately 10K total words (the exact value will be set by reading all the My Sidewalks books one time; 10,651 total words). This is a realistic and tractable number of words to use as training examples. For the condition where the decodable texts are 100% of the training set, then the My Sidewalks books will all be read once through. In the case where the decodable texts represent 75% of the overall vocabulary, 25% of the words will be sampled from children's sources. Likewise for the 50% and 25% conditions.

My Sidewalks 100% condition...

In [7]:
my_sidewalks

['diz',
 'i',
 'am',
 'diz',
 'i',
 'am',
 'nat',
 'nat',
 'can',
 'hop',
 'diz',
 'can',
 'not',
 'hop',
 'diz',
 'can',
 'hop',
 'can',
 'diz',
 'hit',
 'it',
 'i',
 'am',
 'diz',
 'can',
 'i',
 'hit',
 'it',
 'yes',
 'i',
 'can',
 'hit',
 'it',
 'i',
 'can',
 'run',
 'i',
 'can',
 'run',
 'in',
 'the',
 'sun',
 'yes',
 'i',
 'can',
 'win',
 'it',
 'was',
 'hot',
 'the',
 'sun',
 'was',
 'hot',
 'diz',
 'was',
 'hot',
 'diz',
 'got',
 'in',
 'he',
 'got',
 'wet',
 'a',
 'bug',
 'got',
 'in',
 'the',
 'bug',
 'got',
 'wet',
 'it',
 'was',
 'fun',
 'dan',
 'and',
 'the',
 'van',
 'dan',
 'can',
 'hop',
 'in',
 'the',
 'big',
 'red',
 'van',
 'was',
 'tim',
 'in',
 'the',
 'red',
 'van',
 'yes',
 'he',
 'was',
 'in',
 'the',
 'big',
 'red',
 'van',
 'bud',
 'the',
 'pup',
 'diz',
 'had',
 'a',
 'pet',
 'pup',
 'the',
 'pup',
 'was',
 'bud',
 'he',
 'fed',
 'him',
 'diz',
 'hid',
 'in',
 'a',
 'box',
 'with',
 'bud',
 'the',
 'pup',
 'diz',
 'said',
 'run',
 'bud',
 'he',
 'ran',
 'in',


### Control condition
The control condition will be based on the LLI books, which represent "trade books", roughly. From the LLI books we will draw enough texts comprising 10,654 words (i.e., the number of total words in My Sidewalks). Note that the control condition represents a set of words where the My Sidewalks proportion is 0% of the words - that is, the background vocabulary are selected from these control words to comprise 25%, 50%, 75% of words.

In [31]:
control = []

for i in range(N):
    sampled = random.choice(acbc)
    control.append(sampled)

In [26]:
len(control)

10654

Condition: 75% of words are from My Sidewalks 

In [56]:
proportion = .75
treatment_n = round(proportion*N)
background_n = N-treatment_n
assert treatment_n + background_n == N
condition_75 = my_sidewalks[:treatment_n]
condition_75.extend([random.choice(acbc) for i in range(background_n)])
assert len(condition_75) == N

Condition: 50% of words are from My Sidewalks

In [57]:
proportion = .50
treatment_n = round(proportion*N)
background_n = N-treatment_n
assert treatment_n + background_n == N
condition_50 = my_sidewalks[:treatment_n]
condition_50.extend([random.choice(acbc) for i in range(background_n)])
assert len(condition_50) == N

Condition: 25% of words are from My Sidewalks

In [58]:
proportion = .25
treatment_n = round(proportion*N)
background_n = N-treatment_n
assert treatment_n + background_n == N
condition_25 = my_sidewalks[:treatment_n]
condition_25.extend([random.choice(acbc) for i in range(background_n)])
assert len(condition_25) == N

Condition: Trade Books
Just for exploration, let's also include Trade Books, using LLI for that purpose. This can be another training environment. We will length match with My Sidewalks.

In [59]:
trade_books_sample = []
for i in range(N):
    trade_books_sample.append(random.choice(trade_books))

assert len(trade_books_sample) == N

## Write datasets
Let's write these to `data/`

In [64]:
with open('data/my_sidewalks_100_percent.csv', "w") as f:
    for word in my_sidewalks:
        f.write("{}\n".format(word))
        
with open('data/my_sidewalks_75_percent_background_25_percent.csv', "w") as f:
    for word in condition_75:
        f.write("{}\n".format(word))
        
with open('data/my_sidewalks_50_percent_background_50_percent.csv', "w") as f:
    for word in condition_50:
        f.write("{}\n".format(word))
        
with open('data/my_sidewalks_25_percent_background_75_percent.csv', "w") as f:
    for word in condition_25:
        f.write("{}\n".format(word))
        
        
with open('data/my_sidewalks_0_percent_background_100_percent.csv', "w") as f:
    for word in control:
        f.write("{}\n".format(word))
        
with open('data/trade_books_100_percent.csv', "w") as f:
    for word in trade_books:
        f.write("{}\n".format(word))

### Test sets
For test sets, sets of 10,654 words will be selected from children's and adult language sources. For each a random sample will be drawn and a frequency weighted sample will be drawn.