# Overview
This script generates the background vocabulary for our pilot simulations, which will be shared at SSSR in Copenhagen in July 2024. To do this we first read-in the My Sidewalks data, count how many total words are present there and then sample from other relevant language sources to generate the background vocabularies for training. This script also generates the test sets for this training condition. See each section below for more information.

In [23]:
import pandas as pd
import random

import nltk
cmuduct = nltk.corpus.cmudict.dict()
random.seed(765)

In [24]:
cmu = [word.lower() for word in cmuduct.keys() if word.isalpha()]
acbc_full = pd.read_csv('data/vocabulary/tidycorpus.csv')


  exec(code_obj, self.user_global_ns, self.user_ns)


In [25]:
acbc = [word.lower() for word in acbc_full.token.tolist() if isinstance(word, str) & word.isalpha()]
acbc = [word for word in acbc if word in cmu]

In [27]:
programs = pd.read_csv('data/combined_programs.csv')
my_sidewalks = []
trade_books = []

for i, row in programs.iterrows():
    if "Sidewalks" in row['program_name']:
        if isinstance(row.word_raw, str):
            my_sidewalks.append(row.word_raw.lower())
    if "LLI" in row["program_name"]:
        if isinstance(row.word_raw, str):
            trade_books.append(row.word_raw.lower())
            
            
my_sidewalks = [word for word in my_sidewalks if word in cmu]

In [28]:
trade_books = [word.lower() for word in trade_books if isinstance(word, str) & word.isalpha()]
trade_books = [word for word in trade_books if word in cmu]

My Sidewalks has this many total words:

In [29]:
N = len(my_sidewalks)
print(N)

10476


And this many unique words:

In [30]:
print(len(set(my_sidewalks)))

1282


The goal of constructing the background vocabulary is to identify additional sets of (total) words (i.e., tokens, not types) that comprise a certain proportion of the overall training environment. We will implement this such that we have conditions where the program words represent 25%, 50%, 75%, and 100% of the overall training environment.

In [31]:
program_details = []

for i, row in programs.iterrows():
    if "Sidewalks" in row['program_name']:
        program_details.append(row.program_name)

In [32]:
set(program_details)

{'My Sidewalks K',
 'My Sidewalks Level A Unit 1',
 'My Sidewalks Level A Unit 2',
 'My Sidewalks Level A Unit 3',
 'My Sidewalks Level A Unit 4',
 'My Sidewalks Level A Unit 5'}

We will set the overall vocabulary size to be approximately 10K total words (the exact value will be set by reading all the My Sidewalks books one time; 10,651 total words). This is a realistic and tractable number of words to use as training examples. For the condition where the decodable texts are 100% of the training set, then the My Sidewalks books will all be read once through. In the case where the decodable texts represent 75% of the overall vocabulary, 25% of the words will be sampled from children's sources. Likewise for the 50% and 25% conditions.

My Sidewalks 100% condition...

In [33]:
my_sidewalks

['i',
 'am',
 'i',
 'am',
 'nat',
 'nat',
 'can',
 'hop',
 'can',
 'not',
 'hop',
 'can',
 'hop',
 'can',
 'hit',
 'it',
 'i',
 'am',
 'can',
 'i',
 'hit',
 'it',
 'yes',
 'i',
 'can',
 'hit',
 'it',
 'i',
 'can',
 'run',
 'i',
 'can',
 'run',
 'in',
 'the',
 'sun',
 'yes',
 'i',
 'can',
 'win',
 'it',
 'was',
 'hot',
 'the',
 'sun',
 'was',
 'hot',
 'was',
 'hot',
 'got',
 'in',
 'he',
 'got',
 'wet',
 'a',
 'bug',
 'got',
 'in',
 'the',
 'bug',
 'got',
 'wet',
 'it',
 'was',
 'fun',
 'dan',
 'and',
 'the',
 'van',
 'dan',
 'can',
 'hop',
 'in',
 'the',
 'big',
 'red',
 'van',
 'was',
 'tim',
 'in',
 'the',
 'red',
 'van',
 'yes',
 'he',
 'was',
 'in',
 'the',
 'big',
 'red',
 'van',
 'bud',
 'the',
 'pup',
 'had',
 'a',
 'pet',
 'pup',
 'the',
 'pup',
 'was',
 'bud',
 'he',
 'fed',
 'him',
 'hid',
 'in',
 'a',
 'box',
 'with',
 'bud',
 'the',
 'pup',
 'said',
 'run',
 'bud',
 'he',
 'ran',
 'in',
 'the',
 'sun',
 'with',
 'bud',
 'the',
 'pup',
 'mom',
 'said',
 'bud',
 'he',
 'was',

### Control condition
The control condition will be based on the LLI books, which represent "trade books", roughly. From the LLI books we will draw enough texts comprising 10,476 words (i.e., the number of total words in My Sidewalks). Note that the control condition represents a set of words where the My Sidewalks proportion is 0% of the words - that is, the background vocabulary are selected from these control words to comprise 25%, 50%, 75% of words.

In [34]:
control = []

for i in range(N):
    sampled = random.choice(acbc)
    control.append(sampled)

In [35]:
len(control)

10476

Condition: 75% of words are from My Sidewalks 

In [36]:
proportion = .75
treatment_n = round(proportion*N)
background_n = N-treatment_n
assert treatment_n + background_n == N
condition_75 = my_sidewalks[:treatment_n]
condition_75.extend([random.choice(acbc) for i in range(background_n)])
assert len(condition_75) == N

Condition: 50% of words are from My Sidewalks

In [37]:
proportion = .50
treatment_n = round(proportion*N)
background_n = N-treatment_n
assert treatment_n + background_n == N
condition_50 = my_sidewalks[:treatment_n]
condition_50.extend([random.choice(acbc) for i in range(background_n)])
assert len(condition_50) == N

Condition: 25% of words are from My Sidewalks

In [38]:
proportion = .25
treatment_n = round(proportion*N)
background_n = N-treatment_n
assert treatment_n + background_n == N
condition_25 = my_sidewalks[:treatment_n]
condition_25.extend([random.choice(acbc) for i in range(background_n)])
assert len(condition_25) == N

Condition: Trade Books
Just for exploration, let's also include Trade Books, using LLI for that purpose. This can be another training environment. We will length match with My Sidewalks. This is replaced below by more systematic sampling of trade books.

In [39]:
trade_books_sample = []
for i in range(N):
    trade_books_sample.append(random.choice(trade_books))

assert len(trade_books_sample) == N

## Trade book conditions

Condition: 100% trade books, 0% background

In [40]:
proportion = 1
treatment_n = round(proportion*N)
background_n = N-treatment_n
assert treatment_n + background_n == N
condition_100_trade_books = trade_books[:treatment_n]
condition_100_trade_books.extend([random.choice(acbc) for i in range(background_n)])
assert len(condition_100_trade_books) == N

Condition: 75% trade books, 25% background


In [41]:
proportion = .75
treatment_n = round(proportion*N)
background_n = N-treatment_n
assert treatment_n + background_n == N
condition_75_trade_books = trade_books[:treatment_n]
condition_75_trade_books.extend([random.choice(acbc) for i in range(background_n)])
assert len(condition_75_trade_books) == N

Condition: 50% trade books, 50% background

In [42]:
proportion = .50
treatment_n = round(proportion*N)
background_n = N-treatment_n
assert treatment_n + background_n == N
condition_50_trade_books = trade_books[:treatment_n]
condition_50_trade_books.extend([random.choice(acbc) for i in range(background_n)])
assert len(condition_50_trade_books) == N

Condition: 25% trade books, 75% background

In [43]:
proportion = .25
treatment_n = round(proportion*N)
background_n = N-treatment_n
assert treatment_n + background_n == N
condition_25_trade_books = trade_books[:treatment_n]
condition_25_trade_books.extend([random.choice(acbc) for i in range(background_n)])
assert len(condition_25_trade_books) == N

Condition: 0% trade books, 100% background

In [44]:
proportion = 0
treatment_n = round(proportion*N)
background_n = N-treatment_n
assert treatment_n + background_n == N
condition_0_trade_books = trade_books[:treatment_n]
condition_0_trade_books.extend([random.choice(acbc) for i in range(background_n)])
assert len(condition_0_trade_books) == N

In [46]:
acbc_sample = []

acbc_sample.extend([random.choice(acbc) for i in range(N)])

## Write datasets
Let's write these to `data/`

In [48]:
with open('data/my_sidewalks_100_percent.csv', "w") as f:
    for word in my_sidewalks:
        f.write("{}\n".format(word))
        
with open('data/my_sidewalks_75_percent_background_25_percent.csv', "w") as f:
    for word in condition_75:
        f.write("{}\n".format(word))
        
with open('data/my_sidewalks_50_percent_background_50_percent.csv', "w") as f:
    for word in condition_50:
        f.write("{}\n".format(word))
        
with open('data/my_sidewalks_25_percent_background_75_percent.csv', "w") as f:
    for word in condition_25:
        f.write("{}\n".format(word))
        
with open('data/my_sidewalks_0_percent_background_100_percent.csv', "w") as f:
    for word in control:
        f.write("{}\n".format(word))
        
with open('data/kidwords_weighted_sample.csv', "w") as f:
    for word in acbc_sample:
        f.write("{}\n".format(word))

with open('data/trade_books_100_percent.csv', "w") as f:
    for word in condition_100_trade_books:
        f.write("{}\n".format(word))
        
with open('data/trade_books_75_percent_background_25_percent.csv', "w") as f:
    for word in condition_75_trade_books:
        f.write("{}\n".format(word))

with open('data/trade_books_50_percent_background_50_percent.csv', "w") as f:
    for word in condition_50_trade_books:
        f.write("{}\n".format(word))

with open('data/trade_books_25_percent_background_75_percent.csv', "w") as f:
    for word in condition_25_trade_books:
        f.write("{}\n".format(word))
        
with open('data/trade_books_0_percent_background_50_percent.csv', "w") as f:
    for word in condition_0_trade_books:
        f.write("{}\n".format(word))

with open('data/trade_books_weighted_sample.csv', "w") as f:
    for word in trade_books_sample:
        f.write("{}\n".format(word))

### Test sets
For test sets, sets of 10,476 words will be selected from children's and adult language sources. For each a random sample will be drawn and a frequency weighted sample will be drawn.