<a href="https://colab.research.google.com/github/Cyanjiner/classroom-analysis/blob/main/math_corpus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Colab workspace
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
ROOT_FOLDER = "/content/drive/My Drive/transcript_analysis/"
DATA_FOLDER = ROOT_FOLDER + "data/"

RAW_FOLDER = DATA_FOLDER + "raw/"
CLEAN_FOLDER = DATA_FOLDER + "clean/"
MODELS_FOLDER = ROOT_FOLDER + "models/"
LOG_DIR = MODELS_FOLDER + "runs/"

In [4]:
import pandas as pd
from numpy import NaN
import re

# Load all math corpora

- ✅ `ssdf` (`ccss_text.csv`) -- Common Core State Standards

- ✅ `imdf` (`IM_text.csv`) | `math_lang/imdf_sentence_level.csv` -- [Illustrative Mathematics](https://illustrativemathematics.blog/2020/03/26/im-talking-math/)

- ✅ `tdf` (`talk_math.csv`) & `math_lang/talk_math_all.csv | talk_math_sentence.csv`

- `sfdf` (`SFUSD/sfusd_mathtalks.csv`) -- San Francisco Unified School District (SFUSD) Elementary (grade 3-5) Math Talks Bank [orginal resource link](?)

- ✅ `edf` (`engageNY/full_lesson_text.csv`)

- 🤔 (pending for usage) `gdf` (`math_glossary.csv`) -- [CCSS Mathematics Glossary](https://web.archive.org/web/20220206180509/http://www.corestandards.org/Math/Content/mathematics-glossary/)

In [4]:
ssdf = pd.read_csv(DATA_FOLDER + "ccss_text.csv") 
imdf = pd.read_csv(DATA_FOLDER + "IM_text.csv")
tdf = pd.read_csv(DATA_FOLDER + "talk_math.csv")
tdf2 = pd.read_csv(DATA_FOLDER + "tdf2.csv")
gdf = pd.read_csv(DATA_FOLDER + "math_glossary.csv")
sfdf = pd.read_csv(DATA_FOLDER + "SFUSD/sfusd_mathtalks.csv")
edf = pd.read_csv(DATA_FOLDER + "engageNY/full_lesson_text.csv")

## Talking Math

In [763]:
tdf2.head()

Unnamed: 0,grade,lesson,standard,text
0,K,Invitational 1,,"[What kinds of toys do you see?, What can you ..."
1,1,Invitational 1,(K.MD and 1.MD),[How many toys are round? How many have straig...
2,2,Invitational 1,(1.MD and 2.MD.10),"[Complete the sentences:, There are 2 more ___..."
3,3,Invitational 1,(2.MD),"[Pick four kinds of toys and count them. , If ..."
4,4,Invitational 1,3.MD,[These are the lengths of the cars in the pict...


### Clean up Talking Math

In [152]:
import re

In [157]:
with open(DATA_FOLDER + "talking_math.txt", "r") as file:
  tm_text = file.read()

In [167]:
def extract_paren_substring(string):
    start = string.index("(")
    end = string.index(")")
    substring = string[start+1:end]
    return string.replace('('+ substring + ')', ""), substring

In [68]:
extract_paren_substring('(5.NF.B.7) How much juice could all of the oranges in the picture produce?')

(' How much juice could all of the oranges in the picture produce?',
 '5.NF.B.7')

In [168]:
def split_grade_section(lesson_text):
  grade_content = re.split(r"(Grade K+|Grade \d+)", lesson_text)[1:]
  grade_dict = {}
  for j in range(len(grade_content)):
    g_content = grade_content[j]
    if j % 2 == 0: # even index --> grade_content[j] is the separator
      index = grade_content[j]
      #g_text = grade_content[j+1].strip().split('\n')
      g_text = re.sub(r'\s+', ' ', grade_content[j+1])
      g_text = re.sub('Talking Math','', g_text)
      text_list = re.split(r'(?<=[.!?]\s)+',g_text) # split by punctuation characters
      # filtering out empty string & any element w/ "Talking Math" from the g_text list
      text_list = list(filter(None, text_list))
      #get standard if there is one
      for i in range(len(text_list)):
         if  '(' and ')' in text_list[i]:
           new_s, standard = extract_paren_substring(text_list[i])
           text_list[i] = new_s
         else:
           standard = 'NA'
         result = {'standard': standard, "text": [text for text in text_list if text.strip()]}

      grade_dict[index] = result
      #grade_dict[index] = grade_content[j+1]
  return grade_dict

In [169]:
main_section = re.split(r"(Invitational \d+|Day \d+)", tm_text)
lesson = []
text = []
for i, content in enumerate(main_section[1:]):
  if i % 2 != 0: 
    lesson_text = content # if even index --> content is the text follows by lesson
    grade_dict = split_grade_section(lesson_text)
    text.append(grade_dict)
  else: # if odd index --> content is the separator (i.e. lesson index)
    lesson_name = content
    lesson.append(lesson_name)
d = {'lesson': lesson, "text": text}

In [185]:
d['text'][1]

{'Grade 4': {'standard': '3.MD',
  'text': [' These are the lengths of the cars in the picture, organized by color: Turquoise 1 ¾ inch Yellow 1 ¼ inch Red 1 ½ inch Green 1 ¾ inch Orange 1 ¾ inch Describe how you would make a line plot about the length of the cars. ']},
 'Grade 5': {'standard': '4.MD.4',
  'text': [' These are the lengths of the blocks in the picture, organized by color: Turquoise 1 ¼ inch Yellow 1 ½ inch Red 1 ½ inch Green 1 ⅜ inch Orange 1 ½ inch Describe how you would make a line plot of this data about the length of the cars. ']}}

In [183]:
d['text'][1]['Grade 4']['text'][0][83]

'¾'

In [186]:
n = len(d['lesson'])
talk_math_dict = []
for i in range(n):
  lesson_index = d['lesson'][i]
  grade_dict = d['text'][i]
  for grade, values in grade_dict.items():
    grade_name = grade.split(" ")[1]
    standard = values['standard']
    text = values['text']
    talk_math_dict.append({'grade': grade_name, 'lesson':lesson_index, 'standard': standard, 'text': text})

In [189]:
tdf2 = pd.DataFrame(talk_math_dict)

In [190]:
tdf2.head()

Unnamed: 0,grade,lesson,standard,text
0,K,Invitational 1,,"[ What kinds of toys do you see? , What can yo..."
1,1,Invitational 1,K.MD and 1.MD,"[ How many toys are round? , How many have str..."
2,2,Invitational 1,1.MD and 2.MD.10,[ Complete the sentences: There are 2 more ___...
3,3,Invitational 1,2.MD,"[ Pick four kinds of toys and count them. , If..."
4,4,Invitational 1,3.MD,[ These are the lengths of the cars in the pic...


In [200]:
#from google.colab import files
#tdf2.to_csv("tdf2.csv")
#files.download('tdf2.csv')
tdf2.to_csv(DATA_FOLDER + "math_lang/talk_math_all.csv")

### Talking Math -- Sentence Level

In [203]:
tdf2 # filter by grade 3 view

Unnamed: 0,grade,lesson,standard,text
0,K,Invitational 1,,"[ What kinds of toys do you see? , What can yo..."
1,1,Invitational 1,K.MD and 1.MD,"[ How many toys are round? , How many have str..."
2,2,Invitational 1,1.MD and 2.MD.10,[ Complete the sentences: There are 2 more ___...
3,3,Invitational 1,2.MD,"[ Pick four kinds of toys and count them. , If..."
4,4,Invitational 1,3.MD,[ These are the lengths of the cars in the pic...
...,...,...,...,...
644,1,Day 95,1.MD.A.2,"[ Find some of these objects in your home. , H..."
645,2,Day 95,2.MD.A.1,"[ There are rulers in this picture. , What can..."
646,3,Day 95,3.MD.B.4,[ Find 3 of the objects similar to those in th...
647,4,Day 95,4.MD.A.1,[ The longer paint brushes in the picture are ...


In [11]:
import pandas as pd

In [196]:
type(tdf2.iloc[0]['text'])

list

In [201]:
# convert text column to list
#tdf2['text'] = tdf2['text'].apply(lambda x: eval(x))
# split the row into multiple rows using explode
tdf2_by_sentence = tdf2.explode('text')
tdf2_by_sentence.to_csv(DATA_FOLDER + "math_lang/talk_math_sentence_level.csv")

In [202]:
tdf2_by_sentence

Unnamed: 0,grade,lesson,standard,text
0,K,Invitational 1,,What kinds of toys do you see?
0,K,Invitational 1,,What can you tell me about the toys?
1,1,Invitational 1,K.MD and 1.MD,How many toys are round?
1,1,Invitational 1,K.MD and 1.MD,How many have straight sides?
1,1,Invitational 1,K.MD and 1.MD,Do you think there are more round toys or toys...
...,...,...,...,...
647,4,Day 95,4.MD.A.1,Can paint brushes be measured in meters?
647,4,Day 95,4.MD.A.1,Explain.
648,5,Day 95,5.MD.A.1,If all the pencils in the picture were lined ...
648,5,Day 95,5.MD.A.1,What is this measure in meters?


## IM text

In [None]:
imdf.head()

### Decompose by sentence level

In [210]:
imdf['ts_text'] = imdf['ts_text'].apply(lambda x: eval(x))

In [225]:
# decompose by sentence level
imdf_sentence = imdf.explode('ts_text')

In [226]:
imdf_sentence.iloc[0]['ts_text']

'“What are some different ways you and your partner can work together to count the collection?” (One person can count first and then the next can count to see if they get the same amount. We can take turns moving an object and counting a number.)'

In [227]:
imdf_sentence.iloc[3]['ts_text']

'“What does it look and sound like to do math together as a mathematical community?” (We talked to each other and to the teacher. We had quiet time to think. You asked us questions. We shared our ideas. We thought about the math ideas and words we knew. You were writing down our answers. You were waiting quietly until we gave the answers.)'

In [244]:
imdf_sentence.iloc[100]['ts_text']

'“Today we answered questions about data represented with tally marks and numbers. Which representation do you prefer? Why do you like that representation better?” (I prefer tally marks because I don’t have to use cubes or make a drawing to add the numbers together. I prefer the numbers because I don’t have to count.)'

In [245]:
s = re.search(r'“(.*)”\((.*)\)',imdf_sentence.iloc[100]['ts_text'])

In [249]:
double_quotes = re.findall(r'“(.*?)”', imdf_sentence.iloc[100]['ts_text'])

In [252]:
double_quotes

['Today we answered questions about data represented with tally marks and numbers. Which representation do you prefer? Why do you like that representation better?']

In [253]:
parentheses = re.findall(r'\((.*?)\)', imdf_sentence.iloc[100]['ts_text'])

In [264]:
parentheses

['I prefer tally marks because I don’t have to use cubes or make a drawing to add the numbers together. I prefer the numbers because I don’t have to count.']

In [271]:
imdf_sentence['teacher_text']

0        
0        
0        
0        
1        
       ..
2308     
2309     
2309     
2310     
2310     
Name: teacher_text, Length: 12853, dtype: object

### extract teacher & student text respectively into new vars

In [269]:
## extract teacher & student text respectively into new vars
from numpy import NaN
teacher_text = []
student_text = []
for i in range(len(imdf_sentence)):
  ts_text = imdf_sentence.iloc[i]['ts_text']
  if type(ts_text) is str:
    t_text = re.findall(r'“(.*?)”', ts_text)
    s_text = re.findall(r'\((.*?)\)', ts_text)
    if len(t_text) != 0:
      teacher_text.append(t_text[0])
    else:
      teacher_text.append(NaN)
    
    if len(s_text) != 0:
      student_text.append(s_text[0])
    else:
      student_text.append(NaN)
  else:
    teacher_text.append(NaN)
    student_text.append(NaN)
ts_text_dict = {'t_text':teacher_text, 's_text': student_text}

In [277]:
len(ts_text_dict['t_text']) == len(imdf_sentence['ts_text'])

True

In [272]:
t_texts_dt = pd.DataFrame(ts_text_dict)
t_texts_dt.head()

Unnamed: 0,t_text,s_text
0,What are some different ways you and your part...,One person can count first and then the next c...
1,How did you keep track of the objects as you c...,We put each object in a pile after we counted ...
2,How did they show their count?,They drew twelve circles. They wrote the numbe...
3,What does it look and sound like to do math to...,We talked to each other and to the teacher. We...
4,What are some different ways you and your part...,One person can count first and then the next c...


### Merge with main data

In [289]:
t_texts_dt.reset_index(inplace=True, drop=True)

In [287]:
imdf_sub = imdf_sentence[['grade','lesson','standard','ts_text']] # get a subset by selecting a few columns

In [291]:
imdf_sub.reset_index(inplace=True, drop=True)

In [293]:
imdf_sentence_level = pd.concat([t_texts_dt, imdf_sub],
          axis=1, # axis = 0 binding rows, = 1 bind by cols
          #how = 'inner'
          ignore_index=True) 

In [295]:
imdf_sentence_level.columns = ['index','teacher_text','student_text','grade','lesson','standard','ts_text']

In [297]:
imdf_sentence_level

Unnamed: 0,index,teacher_text,student_text,grade,lesson,standard,ts_text
0,0,What are some different ways you and your part...,One person can count first and then the next c...,1,Lesson 1,1.OA.C.5,“What are some different ways you and your par...
1,1,How did you keep track of the objects as you c...,We put each object in a pile after we counted ...,1,Lesson 1,1.OA.C.5,“How did you keep track of the objects as you ...
2,2,How did they show their count?,They drew twelve circles. They wrote the numbe...,1,Lesson 1,1.OA.C.5,“How did they show their count?” (They drew tw...
3,3,What does it look and sound like to do math to...,We talked to each other and to the teacher. We...,1,Lesson 1,1.OA.C.5,“What does it look and sound like to do math t...
4,4,What are some different ways you and your part...,One person can count first and then the next c...,1,Lesson 1,1.OA.C.5,“What are some different ways you and your par...
...,...,...,...,...,...,...,...
12848,12848,What were the most important things about your...,I was able to get three different shapes where...,5,Lesson 18,5.MD.C.3,“What were the most important things about you...
12849,12849,What did the writer of this activity have to p...,"the number of sides, the types of angles or co...",5,Lesson 18,5.MD.C.3,“What did the writer of this activity have to ...
12850,12850,What were the most important things about your...,I was able to get three different shapes where...,5,Lesson 18,5.MD.C.3,“What were the most important things about you...
12851,12851,What did the writer of this activity have to p...,"the number of sides, the types of angles or co...",5,Lesson 18,5.MD.C.3,“What did the writer of this activity have to ...


In [298]:
imdf_sentence_level.to_csv(DATA_FOLDER + 'math_lang/imdf_sentence_level.csv')

## Engage NY

In [None]:
edf

In [310]:
edf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5873 entries, 0 to 5872
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   5873 non-null   int64 
 1   grade        5873 non-null   object
 2   module       5873 non-null   object
 3   topic        5873 non-null   object
 4   lesson       5873 non-null   int64 
 5   objective    5836 non-null   object
 6   lesson_part  5873 non-null   object
 7   standard     5873 non-null   object
 8   text_type    5873 non-null   object
 9   text         5873 non-null   object
dtypes: int64(2), object(8)
memory usage: 459.0+ KB


In [312]:
edf['grade'].unique()

array(['g4', 'g5', 'g3', 'g6'], dtype=object)

In [314]:
edf['grade'] = edf['grade'].map(lambda x: x.lstrip('g')) # remove all g in grade
# check levels again
edf['grade'].unique()             

array(['4', '5', '3', '6'], dtype=object)

In [329]:
# get a dialogue subset
edf_talk = edf[edf['text_type']=='ts_dialogue'][['grade','standard','lesson_part','text']] 
# reset index
edf_talk.reset_index(inplace=True, drop=True)

In [331]:
# change text variable to list
edf_talk['text'] = edf_talk['text'].apply(lambda x: eval(x))
edf_talk['text'][0]

[['T:',
  '(Project place value chart to the thousands.)  Show 4 ones as place value disks.  Write the number below it.'],
 ['S:', '(Draw 4 ones disks and write 4 below it.) '],
 ['T:', 'Show 4 tens disks, and write the number below it.'],
 ['S:', '(Draw 4 tens disks and write 4 at the bottom of the tens column.)'],
 ['T:', 'Say the number in unit form.'],
 ['S:', '4 tens 4 ones.'],
 ['T:', 'Say the number in standard form.'],
 ['S:', '44.']]

In [341]:
edf_talk_sentence = edf_talk.explode('text')
edf_talk_sentence.reset_index(inplace=True, drop=True)

In [342]:
edf_talk_sentence['text'][0]

['T:',
 '(Project place value chart to the thousands.)  Show 4 ones as place value disks.  Write the number below it.']

In [343]:
edf_talk_sentence['text'][1]

['S:', '(Draw 4 ones disks and write 4 below it.) ']

In [351]:
edf_talk_sentence['text'][1][0][:2]

'S:'

In [348]:
edf_talk_sentence['text'][30258]

['Ms. Johnson and Ms. Siple were folding report cards to send home to parents.  The ratio of the number of report cards Ms. Johnson folded to the number of report cards Ms. Siple folded is .  At the end of the day, Ms. Johnson and Ms. Siple folded a total of  report cards.  How many did each person fold? ']

In [349]:
edf_talk_sentence['text'][30258][0][:2]

'Ms'

In [378]:
edf_talk_sentence['talk_source'] = NaN
edf_talk_sentence['clean_text'] = NaN

In [377]:
edf_talk_sentence

Unnamed: 0,grade,standard,lesson_part,text,teacher_talk,student_talk,talk_source
0,4,4.NBT.2,Place Value,"[T:, (Project place value chart to the thousan...",(Project place value chart to the thousands.) ...,,
1,4,4.NBT.2,Place Value,"[S:, (Draw 4 ones disks and write 4 below it.) ]",,(Draw 4 ones disks and write 4 below it.),
2,4,4.NBT.2,Place Value,"[T:, Show 4 tens disks, and write the number b...","Show 4 tens disks, and write the number below it.",,
3,4,4.NBT.2,Place Value,"[S:, (Draw 4 tens disks and write 4 at the bot...",,(Draw 4 tens disks and write 4 at the bottom o...,
4,4,4.NBT.2,Place Value,"[T:, Say the number in unit form.]",Say the number in unit form.,,
...,...,...,...,...,...,...,...
30258,6,,Closing,[Ms. Johnson and Ms. Siple were folding report...,,,
30259,6,,Closing,"[Ms. Johnson folded report cards, and Ms. Sip...",,,
30260,6,,Closing,"[At a country concert, the ratio of the number...",,,
30261,6,,Closing,[There are boys at the country concert. ],,,


### Extract text based on teacher or student & remove instructions

In [383]:
 for i in range(len(edf_talk_sentence)):
  curr_text = edf_talk_sentence['text'][i]
  # start with T:
  if curr_text[0][:2] == 'T:':
    if len(curr_text) < 2 or len(curr_text[1]) <2 :
      continue
    else:
      edf_talk_sentence['talk_source'][i] = 'teacher_text'
      # need to remove every substrings within parentheses --> those are instructions for movement
      if '(' and ')' in curr_text[1]:
        t_talk, instruction = extract_paren_substring(curr_text[1])
        edf_talk_sentence['clean_text'][i] = t_talk
      else:
        edf_talk_sentence['clean_text'][i] = curr_text[1]
  # start with S:
  elif curr_text[0][:2] == 'S:':
    if len(curr_text) < 2 or len(curr_text[1]) <2 :
      continue
    else:
      edf_talk_sentence['talk_source'][i] = 'student_text'
      if '(' and ')' in curr_text[1]:
        t_talk, instruction = extract_paren_substring(curr_text[1])
        edf_talk_sentence['clean_text'][i] = t_talk
      else:
        edf_talk_sentence['clean_text'][i] = curr_text[1]
  # anything else just continue

In [387]:
edf_talk_sentence = edf_talk_sentence.drop(columns=['teacher_talk', 'student_talk'])

In [391]:
edf_talk_sentence['talk_source'].unique()

array(['teacher_text', 'student_text', nan], dtype=object)

In [390]:
edf_talk_sentence.to_csv(DATA_FOLDER + "math_lang/edf_talk_sentence.csv")
files.download(DATA_FOLDER +'math_lang/edf_talk_sentence.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [421]:
edf_talk_sentence

Unnamed: 0,grade,standard,lesson_part,text,talk_source,clean_text
0,4,4.NBT.2,Place Value,"[T:, (Project place value chart to the thousan...",teacher_text,Show 4 ones as place value disks. Write the...
1,4,4.NBT.2,Place Value,"[S:, (Draw 4 ones disks and write 4 below it.) ]",student_text,
2,4,4.NBT.2,Place Value,"[T:, Show 4 tens disks, and write the number b...",teacher_text,"Show 4 tens disks, and write the number below it."
3,4,4.NBT.2,Place Value,"[S:, (Draw 4 tens disks and write 4 at the bot...",student_text,
4,4,4.NBT.2,Place Value,"[T:, Say the number in unit form.]",teacher_text,Say the number in unit form.
...,...,...,...,...,...,...
30258,6,,Closing,[Ms. Johnson and Ms. Siple were folding report...,,
30259,6,,Closing,"[Ms. Johnson folded report cards, and Ms. Sip...",,
30260,6,,Closing,"[At a country concert, the ratio of the number...",,
30261,6,,Closing,[There are boys at the country concert. ],,


## SFDF

In [17]:
sfdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66 entries, 0 to 65
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    66 non-null     object
 1   grade         66 non-null     object
 2   lesson        66 non-null     object
 3   objective     66 non-null     object
 4   description   66 non-null     object
 5   teacher_talk  62 non-null     object
 6   student_talk  53 non-null     object
dtypes: object(7)
memory usage: 3.7+ KB


In [18]:
# check unique values grade has
sfdf['grade'].unique()

array(['4', 'unit 4', 'unit 5', 'unit 3', 'unit 2', 'grade 4'],
      dtype=object)

In [19]:
sfdf['grade'] = sfdf['grade'].replace(['grade 4','unit 4'], '4')
sfdf['grade'] = sfdf['grade'].replace(['unit 5'], '5')
sfdf['grade'] = sfdf['grade'].replace(['unit 3'], '3')
sfdf['grade'] = sfdf['grade'].replace(['unit 2'], '2')
# check unique values grade has after replacing
sfdf['grade'].unique()

array(['4', '5', '3', '2'], dtype=object)

In [20]:
sfdf['student_talk'][0]

'Anticipated Student Responses:\nI saw 4 dots in a line going down, then 3 dots in a line going down. 4 + 3 is 7, then 2 more dots in a line going down, 2 + 7 = 9. Then the last dot alone is 1, and 1 + 9 = 10.  4 + 3 + 2 + 1 = 10\nI saw a group of 4 in a diamond on the right. Then I saw leftover was 3 dots on the top left and 3 dots on the bottom left. So I added 3 + 3 + 4 which is 6 + 4, which is 10. 3 + 3 + 4 = 10\nI saw a triangle of three dots on the right, the top left, and the bottom left. So I thought of 3 groups of 3 which is 9, then added the one in the middle which is 10.\n3 + 3 + 3 + 1 = 10 or (3 x 3) + 1 = 10'

In [25]:
# remove all the Question/Prompt: | Ask: | Question: from text strings
for i in range(len(sfdf)):
  text = sfdf['teacher_talk'][i]
  if text is not NaN:
    sfdf['teacher_talk'][i] = re.sub(r'(Question/Prompt:\s|Ask:\s|Question:\s)', '', text)

In [26]:
sfdf[['teacher_talk']]

Unnamed: 0,teacher_talk
0,How many dots do you see? How do you see them?...
1,Description: Ask students Which amount is grea...
2,How many ___ do you see? How do you see them? ...
3,How many ___ do you see? How do you see them? ...
4,Which one doesn’t belong? Why?
...,...
61,What expression might go with this number line...
62,
63,Where does each number go in this diagram?\nAb...
64,Where does each number go in this diagram?\nWi...


## SSDF

In [423]:
ssdf[['text']]

Unnamed: 0,text
0,Count to 100 by ones and by tens.
1,Count forward beginning from a given number wi...
2,Write numbers from 0 to 20. Represent a number...
3,Understand the relationship between numbers an...
4,"Count to answer ""how many?"" questions about as..."
...,...
326,"Draw construct, and describe geometrical figur..."
327,Solve real-life and mathematical problems invo...
328,Understand congruence and similarity using phy...
329,Understand and apply the Pythagorean Theorem.


# Building Elementary Discourse Math Language Corpus

## Merging datasets

In [6]:
imdf_sentence_level = pd.read_csv(DATA_FOLDER + "math_lang/imdf_sentence_level.csv") 
talk_math_sentence_level = pd.read_csv(DATA_FOLDER + "math_lang/talk_math_sentence_level.csv") 
edf_talk_sentence = pd.read_csv(DATA_FOLDER + "math_lang/edf_talk_sentence.csv")
sfdf = pd.read_csv(DATA_FOLDER + "SFUSD/sfusd_mathtalks.csv")
ssdf = pd.read_csv(DATA_FOLDER + "ccss_text.csv") 

In [9]:
math_df1_t = imdf_sentence_level[['grade','teacher_text']]
math_df1_s = imdf_sentence_level[['grade','student_text']]

In [27]:
math_df1_t['talk_source'] = 'teacher_text'
math_df1_s['talk_source'] = 'student_text'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  math_df1_t['talk_source'] = 'teacher_text'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  math_df1_s['talk_source'] = 'student_text'


In [33]:
math_df1_s.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12853 entries, 0 to 12852
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   grade         12853 non-null  int64 
 1   student_text  12788 non-null  object
 2   talk_source   12853 non-null  object
dtypes: int64(1), object(2)
memory usage: 301.4+ KB


In [35]:
math_df1_s.columns = ['grade','text','talk_source']
math_df1_t.columns = ['grade','text','talk_source']

In [36]:
# merge student and teacher text together
math_df1 = pd.concat([math_df1_s, math_df1_t], axis = 0)
math_df1

Unnamed: 0,grade,text,talk_source
0,1,One person can count first and then the next c...,student_text
1,1,We put each object in a pile after we counted ...,student_text
2,1,They drew twelve circles. They wrote the numbe...,student_text
3,1,We talked to each other and to the teacher. We...,student_text
4,1,One person can count first and then the next c...,student_text
...,...,...,...
12848,5,What were the most important things about your...,teacher_text
12849,5,What did the writer of this activity have to p...,teacher_text
12850,5,What were the most important things about your...,teacher_text
12851,5,What did the writer of this activity have to p...,teacher_text


In [39]:
# check grade level
math_df1['grade'].unique()

array([1, 2, 3, 4, 5])

In [38]:
talk_math_sentence_level['talk_source'] = 'teacher_text'
talk_math_sentence_level.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1476 entries, 0 to 1475
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   1476 non-null   int64 
 1   grade        1476 non-null   object
 2   lesson       1476 non-null   object
 3   standard     1364 non-null   object
 4   text         1476 non-null   object
 5   talk_source  1476 non-null   object
dtypes: int64(1), object(5)
memory usage: 69.3+ KB


In [40]:
# check grade level
talk_math_sentence_level['grade'].unique()

array(['K', '1', '2', '3', '4', '5'], dtype=object)

In [41]:
talk_math_sentence_level['text']

0                         What kinds of toys do you see? 
1                   What can you tell me about the toys? 
2                               How many toys are round? 
3                          How many have straight sides? 
4       Do you think there are more round toys or toys...
                              ...                        
1471            Can paint brushes be measured in meters? 
1472                                            Explain. 
1473     If all the pencils in the picture were lined ...
1474                     What is this measure in meters? 
1475                                           Explain.  
Name: text, Length: 1476, dtype: object

In [44]:
# combine with previous data
math_df2 = pd.concat([math_df1, 
                      talk_math_sentence_level[['grade','text','talk_source']]],
                     axis = 0)

In [45]:
math_df2

Unnamed: 0,grade,text,talk_source
0,1,One person can count first and then the next c...,student_text
1,1,We put each object in a pile after we counted ...,student_text
2,1,They drew twelve circles. They wrote the numbe...,student_text
3,1,We talked to each other and to the teacher. We...,student_text
4,1,One person can count first and then the next c...,student_text
...,...,...,...
1471,4,Can paint brushes be measured in meters?,teacher_text
1472,4,Explain.,teacher_text
1473,5,If all the pencils in the picture were lined ...,teacher_text
1474,5,What is this measure in meters?,teacher_text


In [43]:
# check grade level
edf_talk_sentence['grade'].unique()
# don't include grade 6 for now
edf_talk_sentence = edf_talk_sentence[edf_talk_sentence['grade'] != 6]
# check grade level again
edf_talk_sentence['grade'].unique()

array([4, 5, 3])

In [49]:
edf_talk_sentence.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30075 entries, 0 to 30074
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   30075 non-null  int64 
 1   grade        30075 non-null  int64 
 2   standard     30075 non-null  object
 3   lesson_part  30075 non-null  object
 4   text         30075 non-null  object
 5   talk_source  23045 non-null  object
 6   clean_text   20503 non-null  object
dtypes: int64(2), object(5)
memory usage: 1.8+ MB


In [62]:
edf_sub = edf_talk_sentence[['grade','clean_text','talk_source']].dropna()

In [63]:
# create a new row storing the length of text
edf_sub['txt_length'] = edf_sub.clean_text.str.len()

# drop rows that len(clean_text) < 5
edf_sub = edf_sub[edf_sub['txt_length'] > 5]

# reset index
edf_sub.reset_index(inplace=True, drop=True)

In [68]:
edf_sub.columns = ['grade','text','talk_source','txt_length']
edf_sub.head()

Unnamed: 0,grade,text,talk_source,txt_length
0,4,Show 4 ones as place value disks. Write the...,teacher_text,63
1,4,"Show 4 tens disks, and write the number below it.",teacher_text,49
2,4,Say the number in unit form.,teacher_text,28
3,4,4 tens 4 ones.,student_text,14
4,4,Say the number in standard form.,teacher_text,32


In [None]:
# merge it with math_df2
math_df = pd.concat([math_df2,
                     edf_sub[['grade','text','talk_source']]],
                    axis = 0,
                    ignore_index = True)
math_df

In [72]:
# check string length
math_df['txt_length'] = math_df.text.str.len()

In [74]:
math_df[math_df['txt_length']< 10] 
# well I think these can be considered as noise and remove :)

Unnamed: 0,grade,text,talk_source,txt_length
159,1,\(6 + 2\,student_text,8.0
167,1,\(6 + 2\,student_text,8.0
175,1,\(6 + 2\,student_text,8.0
183,1,\(6 + 2\,student_text,8.0
292,1,4,student_text,1.0
...,...,...,...,...
46239,3,12 tiles.,student_text,9.0
46241,3,6 twos.,student_text,7.0
46253,3,7 threes.,student_text,9.0
46264,3,4 inches!,student_text,9.0


In [78]:
math_df[math_df['txt_length'] < 15]

Unnamed: 0,grade,text,talk_source,txt_length
216,1,\(4 + 2 = 6\,student_text,12.0
220,1,\(4 + 2 = 6\,student_text,12.0
224,1,\(4 + 2 = 6\,student_text,12.0
254,1,\(3 + 7 = 10\,student_text,13.0
256,1,\(3 + 4 = 7\,student_text,12.0
...,...,...,...,...
46214,3,Square inches!,student_text,14.0
46231,3,2 × 4 = 1 × 8.,student_text,14.0
46281,3,4 × 7 = 28.,student_text,11.0
46290,3,6 × 8 = 48.,student_text,11.0


In [79]:
math_df3 = math_df[math_df['txt_length'] > 15] 

In [80]:
math_df3[math_df3['txt_length'] < 20]

Unnamed: 0,grade,text,talk_source,txt_length
417,1,False. \(4 + 2 = 6\,student_text,19.0
422,1,False. \(4 + 2 = 6\,student_text,19.0
427,1,False. \(4 + 2 = 6\,student_text,19.0
519,1,If \(10 + 4 = 14\,student_text,17.0
526,1,If \(10 + 4 = 14\,student_text,17.0
...,...,...,...,...
46112,3,36 square units!,student_text,16.0
46141,3,"5, 7, 8, and 10.",student_text,16.0
46220,3,What is the area?,teacher_text,17.0
46221,3,8 square inches!,student_text,16.0


In [82]:
from google.colab import files
math_df3.to_csv("math_df3.csv")
files.download('math_df3.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Remove useless stuff (start w/ a subset first)

e.g. 
How do you know?
what did you do?

- incomplete sentences/expressions (maybe)
  - e.g. cross each other

- remove duplicates 18238 duplicates 

- non-math 
  - e.g. how do you know? what did you do?

In [5]:
math_test = pd.read_csv(DATA_FOLDER + 'math_lang/math_df3.csv')

In [6]:
math_test = math_test[math_test['txt_length'] >= 50]

In [8]:
math_test[['text']]

Unnamed: 0,text
0,"It’s similar. The fractions are the same, but..."
1,I started by imagining the mat without the pic...
2,The first step is just subtraction. We can do...
3,It does make sense. It couldn’t be put betwee...
4,The sides of the rectangles are all different ...
...,...
15991,How can you find the total area of the rectang...
15992,What number sentence can be used to find the a...
15993,"At your table, place tiles to make the known s..."
15994,Use your tiles to make another side 7 inches l...


In [45]:
type(math_test['text'][0])

str

## Text pre-processing at sentence level

- ✅ lowercase text
- ✅ line breaks removal
- ✅ expand contractions
- 🤔 replace special math symbols with oral expressions
- 🤔 NER tagging & normalization
- ✅ sentence tokenization

##### Install Dependencies

In [None]:
!pip install clean-text
!pip install unidecode
!pip install num2words
!pip install contractions

In [None]:
from num2words import num2words
from cleantext import clean 
import contractions # for expanding contractions 

##### `sent_tokenizer(): str -> str`: Sentence Tokenization

- output: a list of decomposed sentences

In [131]:
import nltk
from nltk.tokenize.punkt import PunktSentenceTokenizer
sent_tokenizer = lambda s: PunktSentenceTokenizer().tokenize(s)
sent_tokenizer(math_test['text'][0])

['It’s similar.',
 'The fractions are the same, but when you draw this one, you have to start with 1 half and then chop that into fourths.',
 'The model for this problem looks like what we drew for Benjamin’s garden, except it’s been turned on its side.',
 'When we wrote the multiplication sentence, the factors were switched around.',
 'This time, we’re finding 3 fourths of a half, not a half of 3 fourths.',
 'If this were another garden, less of the garden is planted in vegetables overall.',
 'Last time, it was \n3 fourths of the garden; this time, it would be only half.',
 'The fraction of the whole garden that is carrots is the same, but now, there is only 1 eighth of the garden planted in other vegetables.',
 'Last time, 3 eighths of the garden would have had other vegetables.']

##### `number_to_words() str -> str`: Convert number to words - helper function used in basic_clean()

In [39]:
def number_to_words(num):
    try:
        return num2words(re.sub(",", "", num))
    except:
        return num

print(number_to_words('14.99'))
print(number_to_words('588'))
print(number_to_words('3'))

fourteen point nine nine
five hundred and eighty-eight
three


##### `basic_clean(): str -> str`: perform first-step sentence level cleaning by:
- Lowercase text
- Remove urls / emails / phone-nunmbers
- Remove line breaks (e.g. \n or \t)
- ⭕ Replace number with word expressions
- 🤔 Keep OR remove punctuations?
- 🤔 Dollar signs removal?

In [148]:
basic_clean = lambda s: clean(s,
    fix_unicode=True,               # fix various unicode errors
    to_ascii=True,                  # transliterate to closest ASCII representation
    lower=True,                     # lowercase text
    no_line_breaks=True,           # fully strip line breaks as opposed to only normalizing them
    no_urls=True,                  # replace all URLs with a special token
    no_emails=True,                # replace all email addresses with a special token
    no_phone_numbers=True,         # replace all phone numbers with a special token
    no_numbers=False,               # replace all numbers with a special token
    replace_with_number= lambda m: number_to_words(m.group()), # replace number with words

    no_digits=False,                # replace all digits with a special token
    no_punct=False,                 # DO NOT remove punctuations
    #replace_with_punct="",          # instead of removing punctuations you may replace them
    
    no_currency_symbols=True,      # replace all currency symbols with a special token
    # what do I do with dollar signs right now?
    replace_with_currency_symbol="", 
    lang="en"                       
)

In [47]:
s = basic_clean('Last time, it’s \n3 \t fourths of the garden; this time, it’d be only half $300.')
s

"last time, it's 3 fourths of the garden; this time, it'd be only half 300."

##### `expand_contract(): str -> str` -- Expand contractions like it's, we'd, etc.

In [48]:
expand_contract = lambda s: ' '.join([ contractions.fix(word) for word in s.split()])
expand_contract(s)

'last time, it is 3 fourths of the garden; this time, it would be only half 300.'

##### `normalize_math_symbol(): str -> str`: Normalize math symbols/expressions

In [149]:
def normalize_math_symbol(s):
  if r'\frac' in s:             # use 1/2 to represent \frac{1}{2} or convert to words
    s = re.sub(r'\\frac\s?\{','', s.replace('}{','/'))
  if r'\div' or '÷' in s:       # Q: change to divided by??
    s = re.sub(r'\\div|÷','divided by', s)
  if r'\times' or '×' in s:     # Q: change it to times or multiply by??
    s = re.sub(r'\\times|×','times',s)
  ## DO WE WANT TO KEEP symbols of + - = > < ?
  # if '+' in s:
  #   s = s.replace('+','plus')
  # if r'-' in s:
  #   s = s.replace('-','minus')
  return re.sub(r'[\\\(\)\{\}\[\]]+','',s) # remove useless symboles like \, (,),{,},[,]

In [129]:
exp = 'Because there are as many groups of \\(\\frac {1}{8}\\. 1 + 2 = 3. 4 ÷ 3. 1 × 2. [] '

In [130]:
normalize_math_symbol(exp)

'Because there are as many groups of 1/8. 1 plus 2 = 3. 4 divided by 3. 1 times 2.  '

##### `ner_normalization(): s -> s`: (sentence level now) Name Entity Recognition Tagging and Normalization 
**Q: Do I perform this step at sentence level or paragraph level / word level?**

###### Spacy -- Fast & pretty robust

In [157]:
import spacy



In [158]:
from spacy import displacy
NER = spacy.load("en_core_web_sm")

In [200]:
def ner_normalization(s):
  text = NER(s1) # train spacy NER model on text
  processed_tokens = []
  for token in text:
    if token.ent_type_ in ['ORG','PRODUCT','GPE','LOC']:
      processed_tokens.append('<entity>') # Replace entities with arbitrary token
    else:
      processed_tokens.append(token.text)
  return ' '.join(processed_tokens) # text with entity normalization

In [201]:
ner_normalization('The model for this problem looks like what we drew for Benjamin’s garden, except it’s been turned on its side.')

'  The model for this problem looks like what we drew for <entity> <entity> garden , except it ’s been turned on its side .    When we wrote the multiplication sentence , the factors were switched around .    This time , we ’re finding 3 fourths of a half , not a half of 3 fourths .'

In [174]:
s1 = ' The model for this problem looks like what we drew for Benjamin’s garden, except it’s been turned on its side.   When we wrote the multiplication sentence, the factors were switched around.   This time, we’re finding 3 fourths of a half, not a half of 3 fourths. '
s1

' The model for this problem looks like what we drew for Benjamin’s garden, except it’s been turned on its side.   When we wrote the multiplication sentence, the factors were switched around.   This time, we’re finding 3 fourths of a half, not a half of 3 fourths. '

In [175]:
text1 = NER(s1)

In [176]:
text1.ents

(Benjamin’s, 3 fourths, a half of 3 fourths)

In [177]:
for word in text1.ents:
  print(word.text, word.label_)

Benjamin’s ORG
3 fourths CARDINAL
a half of 3 fourths CARDINAL


In [193]:
text1[12].ent_type_

'ORG'

###### Using flair -- not as good as spacy in our example

In [None]:
!pip install flair

In [165]:
from flair.data import Sentence
from flair.models import SequenceTagger

In [166]:
# load pre-trained NER model
flair_NER = SequenceTagger.load('ner')



Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/432M [00:00<?, ?B/s]

2023-02-06 06:34:28,800 loading file /root/.flair/models/ner-english/4f4cdab26f24cb98b732b389e6cebc646c36f54cfd6e0b7d3b90b25656e4262f.8baa8ae8795f4df80b28e7f7b61d788ecbb057d1dc85aacb316f1bd02837a4a4
2023-02-06 06:34:34,617 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


In [180]:
# process text
flair_txt = Sentence(s1)

In [181]:
flair_txt

Sentence: "The model for this problem looks like what we drew for Benjamin ’s garden , except it ’s been turned on its side . When we wrote the multiplication sentence , the factors were switched around . This time , we ’re finding 3 fourths of a half , not a half of 3 fourths ."

In [182]:
flair_NER.predict(flair_txt)

In [183]:
flair_txt.get_spans('ner')

[Span[11:12]: "Benjamin" → PER (0.7074)]

#### `text_cell_clean(): s -> list`: Perform complete cleaning pipeline

NER normalization not included yet

In [150]:
def text_cell_clean(text):
  # 1. basic cleaning
  clean_txt = basic_clean(text)
  # 2. expand contractions 
  expand_txt = expand_contract(clean_txt)
  # 3. normalize math symbols
  nor_txt = normalize_math_symbol(expand_txt)
  # 4. sentence tokenization
  sentences = sent_tokenizer(nor_txt)
  return sentences

In [133]:
math_test['text'][0]

'It’s similar.  The fractions are the same, but when you draw this one, you have to start with 1 half and then chop that into fourths.   The model for this problem looks like what we drew for Benjamin’s garden, except it’s been turned on its side.   When we wrote the multiplication sentence, the factors were switched around.   This time, we’re finding 3 fourths of a half, not a half of 3 fourths.   If this were another garden, less of the garden is planted in vegetables overall.  Last time, it was \n3 fourths of the garden; this time, it would be only half.  The fraction of the whole garden that is carrots is the same, but now, there is only 1 eighth of the garden planted in other vegetables.  Last time, 3 eighths of the garden would have had other vegetables.'

In [134]:
text_cell_clean(math_test['text'][0])

['it is similar.',
 'the fractions are the same, but when you draw this one, you have to start with 1 half and then chop that into fourths.',
 "the model for this problem looks like what we drew for benjamin's garden, except it is been turned on its side.",
 'when we wrote the multiplication sentence, the factors were switched around.',
 'this time, we are finding 3 fourths of a half, not a half of 3 fourths.',
 'if this were another garden, less of the garden is planted in vegetables overall.',
 'last time, it was 3 fourths of the garden; this time, it would be only half.',
 'the fraction of the whole garden that is carrots is the same, but now, there is only 1 eighth of the garden planted in other vegetables.',
 'last time, 3 eighths of the garden would have had other vegetables.']

#### Apply cleaning pipeline to every text entry

In [152]:
df = math_test[['grade','text']]

In [153]:
math_test[['text']].head(10)

Unnamed: 0,text
0,"It’s similar. The fractions are the same, but..."
1,I started by imagining the mat without the pic...
2,The first step is just subtraction. We can do...
3,It does make sense. It couldn’t be put betwee...
4,The sides of the rectangles are all different ...
5,"Well, she can make 7 dresses. I guess she’ll ..."
6,"Both addends are mixed numbers, so Solution B ..."
7,It’s true. I just look at the other denominat...
8,What’s the biggest prism I can do? We can do...
9,I hear some students noticing the pattern that...


In [154]:
# apply cleaning pipeline to every text entry
df['text'] = df['text'].apply(lambda x: text_cell_clean(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'] = df['text'].apply(lambda x: text_cell_clean(x))


In [155]:
df[['text']].explode('text').head(1000)

Unnamed: 0,text
0,it is similar.
0,"the fractions are the same, but when you draw ..."
0,the model for this problem looks like what we ...
0,"when we wrote the multiplication sentence, the..."
0,"this time, we are finding 3 fourths of a half,..."
...,...
181,3 ones were changed to 3 tens.
182,i drew the area model showing 3 tenths and 4 h...
182,"then, i decomposed the area into hundredths to..."
182,that meant that i had 30 hundredths and 4 hund...


## Feature Engineering process -- word level (prep for N-gram)

[Primary School Math Word Problems Corpus (Miao, Liang, & Su, 2020](https://arxiv.org/pdf/2106.15772.pdf)

1. Sentence-level Normalization:
  - 🤔 stopword removal (maybe not in our case)
  - name entity normalization
  - 🤔 quantity entity normalization? (e.g. same quantity-values will not be considered different)
2. Word tokenization & POS tagging via Stanford `CoreNLP`
3. Lemmatize each token via `NLTK`

🤔 to Use *Lexicon usage diversity* metric, in terms of [BLEU](https://aclanthology.org/P02-1040.pdf), to measure corpus diversity.