# Format 3000,5000,5000 exclusive data to word lists
format 3000,5000,5000_exclusve in the following formats:

* all data text british: word, type, definition, example, phonetics, cefr
* all data text usa: word, type, definition, example, phonetics, cefr
* all data + pronounciation: same as above with clickable HTML
* 2 column: word, type, definiton

All of the above grouped by cefr

In [25]:
import pandas as pd
import os 

In [26]:
DATASET = 'oxford_3000'
#DATASET = 'oxford_5000'
#DATASET = 'oxford_5000_exclusive'
df = pd.read_pickle(f"./data/{DATASET}.pkl")
df.head()

Unnamed: 0,word,type,cefr,phon_br,phon_n_am,definition,example,uk,us
0,a,indefinite article,a1,/ə/,/ə/,used before countable or singular nouns referr...,a man/horse/unit,a_uk.mp3,a_us.mp3
1,abandon,verb,b2,/əˈbændən/,/əˈbændən/,"to leave somebody, especially somebody you are...","abandon somebody, The baby had been abandoned ...",abandon_uk.mp3,abandon_us.mp3
2,ability,noun,a2,/əˈbɪləti/,/əˈbɪləti/,the fact that somebody/something is able to do...,People with the disease may lose their ability...,ability_uk.mp3,ability_us.mp3
3,able,adjective,a2,/ˈeɪbl/,/ˈeɪbl/,"to have the skill, intelligence, opportunity, ...",You must be able to speak French for this job.,able_uk.mp3,able_us.mp3
4,about,adverb,a1,/əˈbaʊt/,/əˈbaʊt/,a little more or less than; a little before or...,It costs about $10.,about_uk.mp3,about_us.mp3


## HTML+PDF all columns alphabetical

In [27]:
# Complete to HTML
data = df[["word", "type", "cefr", "phon_br", "phon_n_am", "definition", "example"]]
data['cefr'] = data['cefr'].map(lambda x: x.strip().upper())

data = data.rename(columns={'phon_br' : 'phonetic(UK)'})
data = data.rename(columns={'phon_n_am' : 'phonetic(US)'})

style = data.style.format(
    escape="html",
    )
style = style.hide(axis='index')

html = style.to_html()
filename = DATASET + '_alphabetical'
with open(f'output/{filename}.html', 'w') as f:
    f.write(html)

cmd = f'pandoc -f html -t pdf output/{filename}.html -t html5 -o output/{filename}.pdf --metadata pagetitle="{filename}" -V margin-top=2 -V margin-bottom=2 -V margin-left=2 -V margin-right=2 -c format/table.css --pdf-engine-opt=--enable-local-file-access'
os.system(cmd)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['cefr'] = data['cefr'].map(lambda x: x.strip().upper())
Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done                                                                          


0

## HTML+PDF all columns grouped by CEFR

In [28]:
data = df
data['cefr'] = data['cefr'].map(lambda x: x.strip().upper())
cefrs = ['A1', 'A2', 'B1', 'B2', 'C1', 'C2']
data_by_cefr = list(map(lambda c : data[data['cefr'] == c], cefrs))

In [29]:
data_by_cefr[1].head()

Unnamed: 0,word,type,cefr,phon_br,phon_n_am,definition,example,uk,us
2,ability,noun,A2,/əˈbɪləti/,/əˈbɪləti/,the fact that somebody/something is able to do...,People with the disease may lose their ability...,ability_uk.mp3,ability_us.mp3
3,able,adjective,A2,/ˈeɪbl/,/ˈeɪbl/,"to have the skill, intelligence, opportunity, ...",You must be able to speak French for this job.,able_uk.mp3,able_us.mp3
8,abroad,adverb,A2,/əˈbrɔːd/,/əˈbrɔːd/,in or to a foreign country,to go/travel/live/study abroad,abroad_uk.mp3,abroad_us.mp3
13,accept,verb,A2,/əkˈsept/,/əkˈsept/,to take willingly something that is offered; t...,He asked me to marry him and I accepted.,accept_uk.mp3,accept_us.mp3
17,accident,noun,A2,/ˈæksɪdənt/,/ˈæksɪdənt/,"an unpleasant event, especially in a vehicle, ...",a car/road/traffic accident,accident_uk.mp3,accident_us.mp3


In [30]:
# Complete to HTML
html_out = ''
for data in data_by_cefr:
    if data.empty:
        continue
    data = data[['word', 'type', 'phon_br', 'phon_n_am', 'definition', 'example', 'cefr']]
    cefr = data['cefr'].iloc[0]
    html_out += f'<h2>{cefr}</h2>'
    data = data.drop(['cefr'], axis=1)
    print()
    data = data.rename(columns={'word' : f'word ({cefr})'})
    data = data.rename(columns={'phon_br' : 'phonetics (UK)'})
    data = data.rename(columns={'phon_n_am' : 'phonetics (US)'})

    style = data.style.format(
        escape="html",
        )
    style = style.hide(axis='index')
    html_out += style.to_html()


filename = DATASET+'_by_cefr'
with open(f'output/{filename}.html', 'w', encoding='utf-8') as f:
    f.write(html_out)

# to pdf
import os
cmd = f"""pandoc -f html -t pdf output/{filename}.html -t html5 -o output/{filename}.pdf --metadata pagetitle="{filename}" -V margin-top=2 -V margin-bottom=2 -V margin-left=2 -V margin-right=2 -c format/table.css --pdf-engine-opt=--enable-local-file-access --title '{filename}'"""
os.system(cmd)









Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done                                                                          


0

## HTML+PDF all columns grouped by CEFR shuffle

In [31]:
# Complete to HTML
html_out = ''
for data in data_by_cefr:
    if data.empty:
        continue
    data = data[['word', 'type', 'phon_br', 'phon_n_am', 'definition', 'example', 'cefr']]
    cefr = data['cefr'].iloc[0]
    html_out += f'<h2>{cefr}</h2>'
    data = data.drop(['cefr'], axis=1)
    print()
    data = data.rename(columns={'word' : f'word ({cefr})'})
    data = data.rename(columns={'phon_br' : 'phonetics (UK)'})
    data = data.rename(columns={'phon_n_am' : 'phonetics (US)'})

    data = data.sample(frac=1)

    style = data.style.format(
        escape="html",
        )
    style = style.hide(axis='index')
    html_out += style.to_html()


filename = DATASET+'_by_cefr_shuffle'
with open(f'output/{filename}.html', 'w', encoding='utf-8') as f:
    f.write(html_out)

# to pdf
import os
cmd = f"""pandoc -f html -t pdf output/{filename}.html -t html5 -o output/{filename}.pdf --metadata pagetitle="{filename}" -V margin-top=2 -V margin-bottom=2 -V margin-left=2 -V margin-right=2 -c format/table.css --pdf-engine-opt=--enable-local-file-access --title '{filename}'"""
os.system(cmd)









Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done                                                                          


0

## 2 Column LateX word,type and definition alphabetical

In [32]:
import re
# 2 Column word + definition
data = df[["word", "definition", "type"]]
data["word"] = data.apply(lambda row: f"{row.word.strip()} ({row.type.strip()})" , axis=1)
data = data[["word", "definition"]]

style = data.style.format(
    escape="latex",
    )
style = style.hide(axis='index')
style = style.hide(axis='columns')

column_format = 'p{1.2in}p{2.3in}p{1.2in}p{2.3in}'
latex = style.to_latex(
    environment='supertabular',
    encoding='utf8x',
    column_format=column_format
)

# Fix supertabular and add \textit to type
def fix_line(line):
    if re.match(r"^\\begin{supertabular}", line):
        # Add column_format to supertabular}
        return '\\begin{supertabular}'+'{'+column_format+'}'
    if re.match(r"^\\.*{tabular}", line):
        # Remove {tabular}
        return ''
    if re.match(r"^\w+\s.*\(\w+\)", line):
        return re.sub(r"(^\w+\s.*)(\(\w+\))", r"\1\\textit{\2}", line)
    return line

latex_lines = latex.splitlines()
latex = '\n'.join((map(fix_line, latex_lines)))

filename = DATASET + '_table_alphabetical'
with open(f'./build/{filename}.tex', 'w') as f:
    f.write(latex)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["word"] = data.apply(lambda row: f"{row.word.strip()} ({row.type.strip()})" , axis=1)


## 2 Column LateX word,type and definition by CEFR

In [33]:
data = df
data['cefr'] = data['cefr'].map(lambda x: x.strip().upper())
cefrs = ['A1', 'A2', 'B1', 'B2', 'C1', 'C2']
data_by_cefr = list(map(lambda c : data[data['cefr'] == c], cefrs))

In [34]:
data_by_cefr[1].head()

Unnamed: 0,word,type,cefr,phon_br,phon_n_am,definition,example,uk,us
2,ability,noun,A2,/əˈbɪləti/,/əˈbɪləti/,the fact that somebody/something is able to do...,People with the disease may lose their ability...,ability_uk.mp3,ability_us.mp3
3,able,adjective,A2,/ˈeɪbl/,/ˈeɪbl/,"to have the skill, intelligence, opportunity, ...",You must be able to speak French for this job.,able_uk.mp3,able_us.mp3
8,abroad,adverb,A2,/əˈbrɔːd/,/əˈbrɔːd/,in or to a foreign country,to go/travel/live/study abroad,abroad_uk.mp3,abroad_us.mp3
13,accept,verb,A2,/əkˈsept/,/əkˈsept/,to take willingly something that is offered; t...,He asked me to marry him and I accepted.,accept_uk.mp3,accept_us.mp3
17,accident,noun,A2,/ˈæksɪdənt/,/ˈæksɪdənt/,"an unpleasant event, especially in a vehicle, ...",a car/road/traffic accident,accident_uk.mp3,accident_us.mp3


In [35]:
import re
for data, cefr in zip(data_by_cefr, cefrs):
    if data.empty:
        continue
    data = data[["word", "definition", "type", "cefr"]]
    data["word"] = data.apply(lambda row: f"{row.word.strip()} ({row.type.strip()})" , axis=1)
    data = data[["word", "definition"]]

    style = data.style.format(
        escape="latex",
        )
    style = style.hide(axis='index')
    style = style.hide(axis='columns')

    column_format = 'p{1.2in}p{2.3in}p{1.2in}p{2.3in}'
    latex = style.to_latex(
        environment='supertabular',
        encoding='utf8x',
        column_format=column_format
    )
    def fix_line(line):
        if re.match(r"^\\begin{supertabular}", line):
            # Add column_format to supertabular}
            return '\\begin{supertabular}'+'{'+column_format+'}'
        if re.match(r"^\\.*{tabular}", line):
            # Remove {tabular}
            return ''
        if re.match(r"^\w+\s.*\(\w+\)", line):
            return re.sub(r"(^\w+\s.*)(\(\w+\))", r"\1\\textit{\2}", line)
        return line

    latex_lines = latex.splitlines()
    latex = '\n'.join((map(fix_line, latex_lines)))

    filename = f'{DATASET}_{cefr}'
    with open(f'build/{filename}.tex', 'w') as f:
        f.write(latex)
# Render with two_column_by_cefr.tex

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["word"] = data.apply(lambda row: f"{row.word.strip()} ({row.type.strip()})" , axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["word"] = data.apply(lambda row: f"{row.word.strip()} ({row.type.strip()})" , axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["word"] = data.app

## 2 Column LateX word,type and definition by CEFR shuffle

In [36]:
import re
for data, cefr in zip(data_by_cefr, cefrs):
    if data.empty:
        continue
    data = data[["word", "definition", "type", "cefr"]]
    data["word"] = data.apply(lambda row: f"{row.word.strip()} ({row.type.strip()})" , axis=1)
    data = data[["word", "definition"]]

    data = data.sample(frac = 1)

    style = data.style.format(
        escape="latex",
        )
    style = style.hide(axis='index')
    style = style.hide(axis='columns')

    column_format = 'p{1.2in}p{2.3in}p{1.2in}p{2.3in}'
    latex = style.to_latex(
        environment='supertabular',
        encoding='utf8x',
        column_format=column_format
    )
    def fix_line(line):
        if re.match(r"^\\begin{supertabular}", line):
            # Add column_format to supertabular}
            return '\\begin{supertabular}'+'{'+column_format+'}'
        if re.match(r"^\\.*{tabular}", line):
            # Remove {tabular}
            return ''
        if re.match(r"^\w+\s.*\(\w+\)", line):
            return re.sub(r"(^\w+\s.*)(\(\w+\))", r"\1\\textit{\2}", line)
        return line

    latex_lines = latex.splitlines()
    latex = '\n'.join((map(fix_line, latex_lines)))

    filename = f'{DATASET}_shuffle_{cefr}'
    with open(f'build/{filename}.tex', 'w') as f:
        f.write(latex)
# Render with two_column_by_cefr.tex

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["word"] = data.apply(lambda row: f"{row.word.strip()} ({row.type.strip()})" , axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["word"] = data.apply(lambda row: f"{row.word.strip()} ({row.type.strip()})" , axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["word"] = data.app