# benjihillard dictionary processing

I found an english dictionary here: https://github.com/benjihillard/English-Dictionary-Database/blob/main/english%20Dictionary.csv

In [1]:
import pandas as pd
from tqdm import tqdm

In [2]:
raw = pd.read_csv("../../../RawData/Dictionaries/English/benjihillard.csv")

In [3]:
raw.sample(4)

Unnamed: 0,word,pos,def
96627,Milter,n.,A male fish.
24180,Catchwater,n.,A ditch or drain for catching water. See Catch...
87021,Lapidescence,n.,A hardening into a stone substance.
86861,Landward,adv. & a.,Toward the land.


Already we see a problem - all words are capitalised, however some words' meaning changes based on capitalisation, such as the constellation vs disease cancer.

We assume the root of a word to be the uncapitalised form, unless it is a proper noun.

In [4]:
raw[raw.word == "Cancer"]

Unnamed: 0,word,pos,def
22479,Cancer,n.,"A genus of decapod Crustacea, including some o..."
22480,Cancer,n.,The fourth of the twelve signs of the zodiac. ...
22481,Cancer,n.,A northern constellation between Gemini and Leo.
22482,Cancer,n.,"Formerly, any malignant growth, esp. one atten..."


In [5]:
mask = (raw['def'].str.len() > 1000)
raw.loc[mask]

Unnamed: 0,word,pos,def
127355,Reenforce,v.,That part of a cannon near the breech which is...
127356,Reenforce,v.,(b) Reenforce (v.) An additional thickness of...
139644,Shall,v. i. & auxiliary.,"As an auxiliary, shall indicates a duty or nec..."


In [6]:
raw = raw[raw.word != "Reenforce"] # data messed up here, can come back and tidy when it becomes a problem

## Conclusion
 
We'll lowercase everything, but now accept that dictionaries can be fallable. 

In [7]:
def lowercase_first_letter(s):
    return s[0].lower() + s[1:] if s else s

In [8]:
raw['word'] = raw['word'].apply(lowercase_first_letter) # lowercase
raw.drop(["pos"], axis = 1, inplace = True) # remove excess
raw.reset_index(inplace = True)

In [9]:
processed = raw.rename(columns = {"index" : "ID", "word" : "Word", "def" : "Definition"})

In [10]:
save_dir = "../../../ProcessedData/Dictionaries/English/"
import os
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

In [11]:
processed.to_csv(save_dir + "benjihillard.csv", encoding='utf-16', index=False)

In [12]:
df = pd.read_csv(save_dir + "benjihillard.csv", encoding='utf-16')

In [13]:
df

Unnamed: 0,ID,Word,Definition
0,0,a,The first letter of the English and of many ot...
1,1,a,The name of the sixth tone in the model major ...
2,2,a,"An adjective, commonly called the indefinite a..."
3,3,a,"In each; to or for each; as, ""twenty leagues a..."
4,4,a,In; on; at; by.
...,...,...,...
176038,176043,yupon,Same as Yaupon.
176039,176044,yux,"See Yex, n."
176040,176045,yvel,Evil; ill.
176041,176046,ywar,Aware; wary.
