In [1]:
import numpy as np
import pandas as pd
import re
import csv

In [2]:
# I read in my txt version of the textbook dictionary using utf-8 encoding and I'm able to read it in as a list of
# lowercased russian words their page number and definition.
nti = open('./textbook_vocab_data/novice_to_intermediate.txt', encoding= 'utf8')
nti = nti.read()
nti # very messy ugh

'Авария 13 – traffic accident\nавтобус 12 – bus\nавтомобиль/машина 12 – car\nавтомойка 13 – car wash\nаксессуары 13, 15 – accessories\nалкогольный напиток 16 – alcoholic drink\nаллергия 16 – allergy\nальпинизм 10 – mountain climbing\nангина 16 – strep throat\nанкета 14 – application form\nапельсин 5 – orange\nарбуз 5 – watermelon\nаренда 11 – rent n.\nаспирантура 1 – graduate school\nафиша 10 – poster\nБанан 5 – banana\nбанкомат 13 – ATM\nбарабаны 10 – drums\nбаскетбол 10 – basketball\nбег 10 – running\nбежать ~ бегать 12 – to run\nImperative: (Не) Беги/те! (Не) бегай/те!\nбез + gen. 3 – without\nбезветренный 15 – windless, calm\nбелок, pl. белки 5 – protein\nбензин 13 – gasoline\nберег, prep.: на берегу 1 – сoast, shore\nбесплатный 14 – free\nбеспокоить impf. кого? 16 – to bother, disturb\nбеспокоиться impf. о ком? о чём? (за кого? за что? colloq.) 16 – to worry, be worried\nбижутерия 13 – jewelry\nблин, pl. -ы 6 – pancake\nблокнот 7 – notebook\nблондин/блондинка 2 – blond (man)/blond

In [3]:
nti_txt = nti.split('\n') # I turn the file into a list, splitting on linebreaks to divide most of the vocab items
nti_txt_low = [x.lower() for x in nti_txt] # Stress marking was encoded with capital letters, so I lower everything
ntiplay = nti_txt_low[:30] # let't take a look at some of the entries and make it a play set
ntiplay

['авария 13 – traffic accident',
 'автобус 12 – bus',
 'автомобиль/машина 12 – car',
 'автомойка 13 – car wash',
 'аксессуары 13, 15 – accessories',
 'алкогольный напиток 16 – alcoholic drink',
 'аллергия 16 – allergy',
 'альпинизм 10 – mountain climbing',
 'ангина 16 – strep throat',
 'анкета 14 – application form',
 'апельсин 5 – orange',
 'арбуз 5 – watermelon',
 'аренда 11 – rent n.',
 'аспирантура 1 – graduate school',
 'афиша 10 – poster',
 'банан 5 – banana',
 'банкомат 13 – atm',
 'барабаны 10 – drums',
 'баскетбол 10 – basketball',
 'бег 10 – running',
 'бежать ~ бегать 12 – to run',
 'imperative: (не) беги/те! (не) бегай/те!',
 'без + gen. 3 – without',
 'безветренный 15 – windless, calm',
 'белок, pl. белки 5 – protein',
 'бензин 13 – gasoline',
 'берег, prep.: на берегу 1 – сoast, shore',
 'бесплатный 14 – free',
 'беспокоить impf. кого? 16 – to bother, disturb',
 'беспокоиться impf. о ком? о чём? (за кого? за что? colloq.) 16 – to worry, be worried']

It was ultimately easier to use a less coding-intensive solution to end up with a txt file that I can read into this environment. I'm going to try to keep the russian vocabulary item and its english definition so I'll try to get rid of some of the formatting weirdness with regular expressions. I will have to be careful with these extended definitions that continue on to multiple lines

In [4]:
# First, We need to get rid of any page numbers and extraneous white spaces
ntiplay = [re.sub('(\s\d+,|\s\d+\s|\|)', ' ', x) for x in ntiplay]
ntiplay

['авария – traffic accident',
 'автобус – bus',
 'автомобиль/машина – car',
 'автомойка – car wash',
 'аксессуары  – accessories',
 'алкогольный напиток – alcoholic drink',
 'аллергия – allergy',
 'альпинизм – mountain climbing',
 'ангина – strep throat',
 'анкета – application form',
 'апельсин – orange',
 'арбуз – watermelon',
 'аренда – rent n.',
 'аспирантура – graduate school',
 'афиша – poster',
 'банан – banana',
 'банкомат – atm',
 'барабаны – drums',
 'баскетбол – basketball',
 'бег – running',
 'бежать ~ бегать – to run',
 'imperative: (не) беги/те! (не) бегай/те!',
 'без + gen. – without',
 'безветренный – windless, calm',
 'белок, pl. белки – protein',
 'бензин – gasoline',
 'берег, prep.: на берегу – сoast, shore',
 'бесплатный – free',
 'беспокоить impf. кого? – to bother, disturb',
 'беспокоиться impf. о ком? о чём? (за кого? за что? colloq.) – to worry, be worried']

In [5]:
#ntiplay = [re.sub('–', ',', x) for x in ntiplay]
#ntiplay

In [6]:
df_ntiplay=pd.DataFrame(ntiplay,columns=['Entry'])
df_ntiplay

Unnamed: 0,Entry
0,авария – traffic accident
1,автобус – bus
2,автомобиль/машина – car
3,автомойка – car wash
4,аксессуары – accessories
5,алкогольный напиток – alcoholic drink
6,аллергия – allergy
7,альпинизм – mountain climbing
8,ангина – strep throat
9,анкета – application form


In [7]:
df_ntiplay.join(df_ntiplay['Entry'].str.split('–', 1, expand=True).rename(columns={0:'Russian', 1:'English'}))

Unnamed: 0,Entry,Russian,English
0,авария – traffic accident,авария,traffic accident
1,автобус – bus,автобус,bus
2,автомобиль/машина – car,автомобиль/машина,car
3,автомойка – car wash,автомойка,car wash
4,аксессуары – accessories,аксессуары,accessories
5,алкогольный напиток – alcoholic drink,алкогольный напиток,alcoholic drink
6,аллергия – allergy,аллергия,allergy
7,альпинизм – mountain climbing,альпинизм,mountain climbing
8,ангина – strep throat,ангина,strep throat
9,анкета – application form,анкета,application form


There are so many other considerations that have to be made for pulling some of these items out. I have to think more about what needs to be removed because there is a LOT of grammatical information coming in tandem with some of these items. For example the preposition "без" is being given with the case it governs (genitive). The verbs of motion "бежать ~ бегать" have present and past conjugations and imperatives along with the definition. Splitting on the new line character to create a list was a first good step, but I might have to go low tech and try to figure out which addtional info needs to be removed... frustrating

In [None]:
# I'm running into more 
russ_words = pd.read_csv("./textbook_vocab_data/russian-word-list-total.csv")