# benjihillard dictionary processing

I found an english dictionary here: https://github.com/benjihillard/English-Dictionary-Database/blob/main/english%20Dictionary.csv

In [37]:
import pandas as pd
from tqdm import tqdm

In [38]:
raw = pd.read_csv("../../../RawData/Dictionaries/English/benjihillard.csv")

In [39]:
raw.sample(4)

Unnamed: 0,word,pos,def
152854,Sunshiny,a.,"Bright with the rays of the sun; clear, warm, ..."
154505,Swobber,n.,See Swabber.
148412,Staphyline,a.,Of or pertaining to the uvula or the palate.
33377,Convallaria,n.,The lily of the valley.


Already we see a problem - all words are capitalised, however some words' meaning changes based on capitalisation, such as the constellation vs disease cancer.

We assume the root of a word to be the uncapitalised form, unless it is a proper noun.

In [40]:
raw[raw.word == "Cancer"]

Unnamed: 0,word,pos,def
22479,Cancer,n.,"A genus of decapod Crustacea, including some o..."
22480,Cancer,n.,The fourth of the twelve signs of the zodiac. ...
22481,Cancer,n.,A northern constellation between Gemini and Leo.
22482,Cancer,n.,"Formerly, any malignant growth, esp. one atten..."


## Conclusion
 
We'll lowercase everything, but now accept that dictionaries can be fallable. 

In [41]:
def lowercase_first_letter(s):
    return s[0].lower() + s[1:] if s else s

In [42]:
raw['word'] = raw['word'].apply(lowercase_first_letter) # lowercase
raw.drop(["pos"], axis = 1, inplace = True) # remove excess
raw.reset_index(inplace = True)

In [45]:
processed = raw.rename(columns = {"index" : "ID", "word" : "Word", "def" : "Definition"})

In [46]:
processed

Unnamed: 0,ID,Word,Definition
0,0,a,The first letter of the English and of many ot...
1,1,a,The name of the sixth tone in the model major ...
2,2,a,"An adjective, commonly called the indefinite a..."
3,3,a,"In each; to or for each; as, ""twenty leagues a..."
4,4,a,In; on; at; by.
...,...,...,...
176043,176043,yupon,Same as Yaupon.
176044,176044,yux,"See Yex, n."
176045,176045,yvel,Evil; ill.
176046,176046,ywar,Aware; wary.


In [48]:
save_dir = "../../../ProcessedData/Dictionaries/English/"
import os
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

In [50]:
processed.to_csv(save_dir + "benjihillard.csv", encoding='utf-16', index=False)